7.1 Tree Analysis Overview

ChemTree’s tree analysis engine is based on Formal Inference Recursive Modeling (FIRM) technology. FIRM has its roots going back to work done in the 1970’s and 1980’s by Dr. Douglas Hawkins ( http://www.douglashawkins.com).

Early recursive partitioning approaches, such as AID, suffered from a lack of statistical rigor. Dr. Hawkins introduced statistical hypothesis testing as a means for better characterizing the statistical validity of the models generated. FIRM was released in the early 1980’s as a non-GUI package and is still in use today.

ChemTree has taken the statistical foundations of FIRM and augmented it with faster and more exact segmenting algorithms. It has also extended FIRM methods to include multivariate response. We are grateful for the continued assistance of Dr. Hawkins in devising and improving many of the statistical and algorithmic methods underlying ChemTree.

The best way to gain a working understanding of tree analysis is to go through an example.

NOTE: In the rest of this chapter, we generally assume that only one variable (column) has been designated in the spreadsheet to be dependent, and that if that dependent variable is categorical, that it contains only two categories. See 10 for a discussion of multiple dependent variables and multi-category dependent variables.

The parent population of 2018 binary observations found in the example file, external.ghd, has a mean of 0.5. Thus half are active, and half are inactive. The descriptor, PLHI:C(CC)-O(N) is an integer distance range within a molecule, described in section Data Conversion Using HTS Convert 4.3.2. A given compound will have a unique value for this descriptor, anywhere from 1 to 18 in this data set, or missing. The software determines which distance ranges are optimal. In this example, it finds that the data does not support more than a two way split according to statistical hypothesis testing. However the two-way split is highly significant according to a modified chi-squared statistic. The split’s p-value of 8.06e-58 is Bonferroni-adjusted to a larger value of 1.30e-54 due to the fact that 1714 descriptors were searched through before determining that this was the most significant. We now have two populations, one with 43% active, and one with 90% active. The effect of the first structural feature has now been factored out of the two subgroups.


[Picture]
Figure 7.1: This is the sample tree described in the text.

From here, we analyze the two subgroups conditioned on the variable that drove the first split. The 1723 compounds are now split into 3 subsets:

The result of the split shows a highly significant predictor of activity with three distance ranges found for the predictor PLHI:C(CCO)-C(CO). The low and high distance ranges have a 90% and 83% activity respectively, while the middle range (including missing values) has a low activity of 40%.

The recursive process of partitioning the split groups into subgroups can be continued until no statistically significant splits are found. The choice of predictors can be done automatically (choosing the one that is most significant), or manually by a subject-matter expert. Manually splitting allows the user to pop up a list of significant predictors and choose the ones that makes biological or chemical sense.


[Picture]
Figure 7.2: This is an example of the tree saved as a bitmap (.bmp) file

After a recursive partitioning model is created of the compound potency data, relating structural features to compound activity, it is possible to predict the potency of new compounds. Using the “Cherry Pick Compounds” feature 7.7.4, untested compounds are effectively “dropped” down the tree into their appropriate place according to their structural features. If the model is good, compounds that fall in high potency nodes are more likely to have a higher potency than compounds simply chosen at random. Using multiple randomly generated trees it is possible to get an average prediction across many models that improves over single tree prediction. Hit rates in cherry-picked compounds tend to be anywhere from 10 to 100-fold over randomly selected compounds.

In figure 7.3 we visualize the compounds that reside in the node that has 295 compounds with an average potency of 0.9. We see circled the focal atoms from the structural descriptor that caused the highly significant split, with the shortest paths between them highlighted in orange. (See section 7.3, Working with Nodes for more on Visualizing Compounds) Highlighting of paths can be turned off and on from the edit menu. The oxygens in one or more NO2 groups are circled, as well as an aromatic carbon. It may be that the highly reactive NO2 group induces a reaction at the aromatic carbon.


[Picture]
Figure 7.3: A 2-D rendering of the compounds

The next split in the tree finds a different mechanism for mutagenicity. All 29 compounds at left are mutagens, and the 1694 compounds at right are further split 3 ways into 3 distance ranges of PLHI:C(CCO)-C(CO). The middle distance range has a low potency, and the lower and upper ranges have a high potency. The splitting can be continued until no further significant splits are possible.