Identify Lead Series
Problem
With large numbers of compounds, it is a time consuming and error-prone process to classify them into families of compounds that act according to a similar mechanism.
Solution
Using ChemTree's multiple tree clustering algorithm, you can automatically cluster compounds together that act according to similar QSAR mechanisms. In our model building, we discover the molecular features that drive activity. Hence, we can calculate a similarity metric between compounds within our models and cluster them according to the features that drive activity, rather than molecular subgraph similarity. In the following, we will show an example of automatically differentiating between two classes of MAO inhibitors, where ChemTree's Multiple Tree Clustering feature allows you to:
- Simultaneously cluster compounds according to structure, activity and binding mechanism.
- Visualize the distillation of hundreds of models in a single cluster plot.
- Easily locate clusters of highly active compounds that are functioning according to a similar SAR.
- More easily identify lead series.
Step 1: Analyze your screening data with ChemTree. Then, use ChemTree to create a multiple tree model. A multiple tree model is a “forest” of many recursive partitioning (RP) trees where the descriptor to be split at each level of each tree is chosen at random from the best possibilities. By doing this, effects that may have been hidden from view when using only one tree are able to surface in at least some of the trees of the “forest”.

Step 2: Once a multiple tree model is created, use ChemTree's multiple tree clustering feature. ChemTree measures a “distance” (lack of similarity) between each pair of compounds. The “distance” between compound i and j for a given tree is taken as the number of observations in the deepest node, where compounds i and j reside together, divided by the total number of compounds. We average this metric over all the trees, and cluster the all-pairs distance matrix by sorting it on its first principal component. The idea behind this metric is that in general, the smaller the subgroup a set of compounds falls into, the more rules they share in common. Therefore, the features driving activity also drive the clustering.
The symmetric plot above shows the all-pairs compound distance matrix with the compounds reordered by the first principal component of the distance matrix. Blue blocks in the diagonal represent groups of compounds that are similar in multiple tree space. Red regions indicate pairs of compounds that are different from each other.
The plot below shows the activity for the compounds on a scale of 0-3, with 3 being most active.

We see below that there are a number of compounds with high activity falling into a couple of clusters. Next, we zoom in on the high activity clusters.

Step 3: Once we have identified clusters of similar compounds with high activity, we can view the compounds in the cluster. The two clusters found by ChemTree correspond to two different classes of compounds with very different binding mechanisms. Further, ChemTree automatically highlights the portion of the molecule that was used in distinguishing the compounds from less active ones using multiple tree atom highlighting:

The upper right cluster corresponds to pargyline-like compounds: a triple bond, a tertiary nitrogen, and an aromatic ring. These suicide inhibitors covalently attach to the MAO flavin cofactor.

The lower left cluster of compounds correspond to a second class of compounds: the N-N-C=(O) hydrolyzes to hydrazine and the resulting compound covalently binds to the protein as a suicide inhibitor.