‹‹ Back to SVS Home
13.2 Viewing Correlation Interactions
13.2 Viewing Correlation Interactions
|
13.2.1 About Correlation Interaction
If the two descriptor variables are highly correlated, a split on the first descriptor would tend to prohibit a subsequent split on the second, because that effect has already been factored out in the first split. An example of such a correlation effect might be a pair of descriptors, one containing the splitter, path length x between C(CC) and C(CCC) and the other path length x between C(CC) and C(CCN), where the two descriptors cover substantially the same regions of the molecules described.
On the other hand, there are instances in which a response is likely only if two of the descriptors are present together in a set of compounds. In this case there is an increase in the probability that the second group will split the compounds that were first split by the primary group. These are interaction effects.
It is difficult by looking at a single tree, to distinguish between independent effects, and higher order interactions. If we use path length as a descriptor variable, for example, and the path length between a C=S and C=C-C group and the path length between a C-F and Cu-N group both contribute a constant increase in activity independent of each other, then we could create a tree which first splits on C=S/C=C-C, then on C-F/Cu-N, or vice versa. Just because C-F/Cu-N appeared below C=S/C=C-C in a given tree, it would not necessarily indicate a higher order effect. By looking at one tree, the presence or absence of variables may be due to higher order effects, or due to there not being enough data to pull out all of the effects due to a lack of significance with the small samples towards the leaves of the trees.
13.2.2 Determining Higher Order Effects
By having a forest of trees, it is possible to determine both correlation and interaction effects — even among descriptor variables of mixed type (binary, integer, double or categorical). Consider how this might be done. We can speak of the number of compounds that a descriptor influences in a tree. A predictor variable that splits the root node influences the entire tree. A split on a daughter node will influence all observations beneath that one. Suppose descriptor variable v1 influences 50% of the compounds across all trees and descriptor variable v2 influences 20% of the compounds across all trees. If they are independent, then on average we would expect 10% of the compounds to be jointly influenced by both of them (50% X 20% = 10%). If we see significantly less than this ratio, then this is evidence of a correlation effect among the descriptors. If we see significantly more than this ratio, it is evidence of an interaction effect among the descriptors.
By using the above dialog, and selecting a set of descriptor variables, it is possible to do a correlation/interaction view of the descriptor variables across a set of random trees.
13.2.3 The Correlation Interaction View
Figure 13.3 shows a matrix of the selected variables in the order they were sorted in the Multitree Model window.
The numbers appearing in the black diagonal blocks represent the average proportion of cases described by the variable across all trees. They show the proportion of all subjects, over all enumerated random trees, for which the indicated variable was a subject splitter.
13.2.4 The Upper Triangle
The upper triangle of colored blocks represents the average proportion of cases described by the variable pair across all trees. That is, the average proportion of cases that the two variables jointly describe. Each cell in the upper triangle displays the proportion of all subjects over all enumerated random trees for whom the indicated variables were both subject splitters.
13.2.5 The Lower Triangle
The lower triangle of colored blocks represents the statistical difference in standard deviations between the actual number of cases described by the variable pair, and the expected number of cases. A positive number represents evidence of an interaction effect, and a negative number represents evidence of a correlation between the variables. One unit represents one standard deviation.
Coloring is as follows:
Blue represents evidence of correlation. Darker blue represents more evidence of correlation.
Red represents evidence of interaction. Darker red represents more evidence of an interaction effect.
The color spectra range from zero to four standard deviations.
Closer to white indicates the variables are independent.
Suppose ni describes the number of observations influenced by variable i, and nj describes the number of observations influenced by variable j, and nij describes the number jointly influenced. Then the statistic computed is:

Figure 13.4 shows a sample correlation/interaction matrix.
|
The strongest correlation is between the PLHI:C(CCC)-C(CCC) and PLHI:C(CC)-C(CCC), 8.5 standard deviations lower than what is expected by independence. These variables are almost synonymous. The strongest interaction is between PLLO:C(CCl)-C(CCl) and PLLO:C(CCO)-C(CO), 6.0 standard deviations higher than expected by independence.