ChemTree Drug Discovery Software Technology Review

High throughput screening (HTS) is the process of screening large numbers of compounds against disease targets in order to identify biologically active compounds.

Today’s "brute-force" approach to High Throughput Screening (HTS) is enabling scientists to meet their screening goals, but it is not without problems. Chief among them is the high cost of the procedure. By implementing a "smart" approach to HTS it is possible to secure the same number of hits while testing far fewer compounds. As a result, impressive savings are accomplished, because much less funds have to be spent for compound acquisition, reagents, labor, disposal cost, etc.

Sequential High Throughput Screening is the "smart" approach to HTS. It is the iterative process of screening a sample of compounds for activity, analyzing the results, and selecting a new set of compounds for screening, based on what has been learned from the previous screens. Selection of compounds is driven by finding Structure Activity Relationships (SARs) within the screened compounds and using those relationships to drive further selection.

Recursive partitioning (RP) is the advanced statistical methodology that makes smart HTS possible, by identifying relationships between specific chemical structural features of the molecules and biological activity. The premise is that the biological activity of a compound is a consequence of its molecular structure. Accordingly, it is very useful to identify those aspects of molecular structure that are relevant to a particular biological activity. By gaining a better understanding of the mechanism by which the compound acts, we can better select additional compounds for screening.

We have posted an excellent article about RP by Professor Douglas Hawkins on our web site. This article was first published in BIOINFORMATICS WORLD in June of 2003. Among other things, it explains how RP can sharpen the drug discovery process. To read more, click here.

Quantitative Structure Activity Relationship (QSAR) models are determined using sets of compounds whose molecular structure and biological activity are known, a training set. QSAR approaches depending on the linear model assumption may generate unreliable results. This assumption implies the activity varies linearly with the level of whatever features affect it, and that there are no interactions among the different features.

Both these assumptions are highly suspect when attempting to relate chemical structural features to biological activity. Activity can easily result from threshold effects; a feature must be present at least some threshold level for activity to occur. Furthermore, interactions between features are observed in many QSAR settings, the utility of one feature depending upon the presence of another. Activity may require the simultaneous presence of two features. In particular, a molecule may be active if some optimal distance separates two features. If the features are too close, the compound is inactive. If the features are too far apart, then the compound is inactive.

Recursive Partitioning methods (Hawkins and Kass, 1982; Breiman, et al. 1984) overcome these difficulties. RP methods are able to model nonlinear relationships of almost arbitrary form, even in the presence of strong interaction between the predictors. The output of a recursive partitioning analysis is a dendrogram (tree) in which predictors are used to progressively split the data set into smaller and more homogeneous subsets. If some node in the dendrogram contains mainly active compounds, then the detailed path by which its molecules are split out provides a clue to the molecular structures that are associated with activity. The path to a node whose cases are predominantly inactive is a clue to the molecular structures that have no bearing on, or that actively inhibit activity. Hawkins, Young, and Rusinko (1997) provide an illustration of the analysis of a screening data set using FIRM.

There are two standard uses for the dendrogram. First, its structure provides an indication of which predictors are important for explaining the dependent variable. The other use is as a method of prediction; by following a future case with unknown dependent variable down to the final terminal node into which it falls based upon its independent variables, one may use the mean of the data in that node as a predictor of the new observation. Rusinko, Farmen, Lambert, Brown, and Young (1999) demonstrate the predictive power of RP to achieve a 1500% hit rate increase over random for MAO inhibitors.

 

1. Hawkins, D.M. and Kass, G.V., (1982) Automatic Interaction Detection. In Topics in Applied Multivariate Analysis; Hawkins, D. H., Ed.; Cambridge University Press, pp. 269-302.

2. Hawkins, D.M., Young, S.S., and Rusinko, sA. III (1997) “Analysis of a large structure-activity data set using recursive partitioning,” Quant. Struct.-Act. Relat. 1997, 16, 296-302.

3. Rusinko, A. III, Farmen, M.W., Lambert, C.G., Brown, P.L., and Young, S.S. (1999) “Analysis of a large structure/biological activity data set using recursive partitioning,” J. Chem. Inf. Comput. Science v. 39 no. 6 pp. 1017-1026.

4. Breiman, L., Friedman, J. Olshen, R. and Stone, C. (1984) Classification and Regression Trees. Chapman and Hall.