Data Mining Methods for Quantitative Structure-Activity Relationships

Home » Fall 2010» Data Mining Methods for Quantitative Structure-Activity Relationships

» Posted by Sintjago on Nov 15, 2010 in Fall 2010 | 0 comments

CSCI 8970 – Colloquium Series – Fall 2010 – Tenth Event
Data Mining Methods for Quantitative Structure-Activity Relationships

Cray Distinguished Speaker Series

Monday, November 15, 2010

Presenter: Stefan Kramer, Technische Universität München

Dr. Kramer’s lecture focused on data mining methods for quantitative structure-activity relationships. He began the lecture by focusing on graph mining in which through a process that focuses on frequency and searching for patterns through different search strategies and the use of substructures in statistical learning models statistical data can be combined into structures such as (O)SAR models. Yet, while this technology is useful, it is also limited. Some of the limitations include long running times, excessive number of frequent or class-correlated substructures.

Consider the following task, given a database of tens of thousands (or more) structures, how can one find all the substructures that are over-represented in one class of activity and under-represented in another (produces bad results)? These are some of the questions his data mining team has attempted to improve upon. One of the goals is scalability by using a new practical class of substructures. This new practical class of substructures served as the backbone refinement classes (BBRC), i.e., trees sharing a common backbone. Then pick the most significant representative from this class. Latest structure of pattern mining included automatically discovering structural alerts. 3 steps: align, stack, compress. The results were for a blood-brain barrier, bioavailability.

When attempting to improve the reliability of (Q)SAR models, they concluded that the chemical space is messy. Structural clustering now uses local models. They create a training set (through a preprocessing) and then test the chemical space, diminishes the messiness. Testing it through the use of algorithms has shown that it improves on previous models. Fast conditional density estimation for QSARs relies on the prediction of distribution of activities that are not point estimates or of quantifying uncertainly. To do it faster, they are using general purpose machine learning as plug-in and ten use a histogram estimator. They use CED via class probabilities. They are equal-frequency binning. Method that provides class probability estimate using work for each training. The training time is roughly 100 times faster.

When taking in Account 3D information they address the following problems: theoretical: numerical refinement in ILP and Practical: pharmacophore discovery. With their OpenTox REST web service approach they use API, web services, domain dep./ independent, algorithms, ontologies, and used cases and demos. The target audience was composed of toxicologists, risk assessors, model builders, and computer scientists.

Regarding strategic context and goals, reach: possible reduction of test animals by using existing experimental data in conjunction with QSAR. There are also practical needs: reporting and form filling. By the OECD principles, a number of requirements to a framework like OpenTox arise. Some of its features include representational state transfer (REST): what and why? Architectural style for distributed information systems on the web, simple interfaces, data transfer via hypertext transfer protocol (GTTP), stateless.

Within the ontologies there were both formal, shared conceptualization of a domain, and distributed services need to be able to “talk to each other”, i.e. have a common understanding of endpoints, any type of property, methods, etc. Regarding their API, it was composed of a dataset with features and compounds. The interface definition can be found in the OpenTox website which can be used for distributed applications, integrating wide range of data, models, prediction methods and the integration into workflow systems for computational biology.

Dr. Kramer concluded by reviewing the data mining methods presented, the method-oriented work motivated by problems from implicated areas, the suitable representation of (2D and 3D) models, as well as still many interesting unsolved problems. OpenTox, by being available to anyone is increase the global access to knowledge. Although Dr. Kramer’s lecture hard to follow for individual for someone whose main field is computer science, they can be used on Youtube and other video up loaders.