Correlations and Interactions
Correlation and Interaction effects with forests of random trees
A great strength of Recursive Partitioning is its ability to handle correlation and interaction effects among variables. Once a variable has been selected as a predictor, it has been factored out of the analysis, and the further splits are all conditioned on the prior ones. However, it is difficult by just looking at a tree to distinguish between main effects and higher order interactions. If, for example, age and sex both contributed a constant increase in activity independent of each other, then a tree might be computed which first splits on age, then on sex, or vice versa. Obviously, in this case, just because sex appeared below age in a given tree, it would not necessarily indicate an interaction effect.

Figure 1. Random tree browsing window
Correlation effects are another important issue. If two variables were highly correlated, a split on one variable would tend to prohibit a subsequent split on the second variable because that effect has already been factored out in the first split. However, just by looking at one tree, the presence or absence of variables may be due to a correlation or to a lack of data to pull out of all the effects. This may be the case because there is a shortage of significance with the small samples towards the leaves of the trees.
By creating a forest of random trees with the multiple tree function, it is possible to detect both correlation and interaction effects - even among variables of mixed types (nominal, continuous, binary, ordinal, genetic).
For example, consider a number of data points where a variable influences in a tree. A predictor variable that splits the root node influences the entire tree. A split on a daughter node will influence all observations beneath that one. Suppose variable v1 influences 50% of the data points and variable v2 influences 20% of the data points. If they are independent, then on average we would expect 10% of the data points to be jointly influenced by both of them (50% X 20% = 10%). If we see significantly less than this ratio, then this is evidence of a correlation effect among the variables. If we see significantly more than this ratio, it is evidence of an interaction effect among the variables.
Figure 1, shows a set of variables, Smoke? through Age, selected in the dialog screen. Then, by clicking on the "Correlation/Interaction View" button, a plot of the correlation and interaction of the variables across a set of random trees is calculated. In this example, six descriptors have been selected to be included in the correlation/interaction plot.

Figure 2. The Correlation Interaction View Window.
In Figure 2, the diagonal line of black blocks shows the average proportion of cases described by that variable across all trees.
The numbers in the blocks in the upper triangle represent the average proportion of cases described jointly by the variable pair across all trees.
The numbers in the blocks in the lower triangle represent the number of standard deviations between the actual number of cases described by the variable pair and the expected number of cases.
A positive number (in red) denotes evidence of an interactive effect. A negative number (in blue) points to a correlation between the variables. Weak correlations and interactions are shown in white.
In this example, we see that Body Mass Index (BMI) is correlated with Smoking. In this particular data set, smokers weigh less than non-smokers. Also, Gen_A strongly interacts with smoking. The variable Gen_A on it´s own influences nearly 8% of the data (0.083 in black square where Gen_A and Gen_A intersect).
However the variable Smoke? and Gen_A also jointly influence nearly 8% of the data (the red square 0.083 upper triangle where Gen_A and Smoke? intersect). Because this number (0.083) is the same as the in the black square, we conclude that the genetic effect is only activated in smokers.
If the two variables (Gen_A and Smoke?) were statistically independent one would expect that they jointly influence only 5.56% of the data. Instead it is significantly (3.6 standard deviations) higher. From the tree corresponding to this plot, we can see that only female smokers have their response mitigated by Gen_A. We also find that every time Gen_A showed up in a tree it was in combination with Smoke?.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |





