Recursive Partitioning Technology Review
By Professor Douglas Hawkins, School of Statistics, University of Minnesota
*This article is reprinted here with the kind permission of BIOINFORMATICS World and Douglas Hawkins.
Partitioning: Divide and Model
One of the best ways to sharpen the drug discovery process, and many other scientific modeling problems, is to partition the data. Now there is software available to do it for you, writes Douglas Hawkins.
Modeling and prediction problems are common across the sciences. Examples occur in such diverse areas as archaeology (finding characteristics that distinguish pot shards of one era from those of another), chemistry (quantitative structure-activity relationship modeling), environmetrics (identifying sources of pollution), geology (understanding the factors driving sedimentation), medicine (developing prognostic indicators for patient recovery), sociology (finding indicators of anomie), and zoology (improving taxonomic definitions and classification).
A modeling or prediction problem arises when we have some 'dependent' variable of interest, and a number of 'predictor' variables thought to be related to the dependent. In this setting, two interrelated objectives are:
- To establish which predictors are related to the dependent
As a non-science example, the response to a marketing mailing is affected by many characteristics of the mailing itself, (product, incentives, description) and of the respondents (gender, age, SES). Effective mailing matches the piece and the circulation list to optimize the response, and to do this you need to know which characteristics affect response, and in what way. The relationship between predictors and response may be complex. Intuition rather than data analysis would tell you to send a piece advertising power tools to men in their 30s and 40s, while information on knitting patterns might go mainly to new grandmothers. Common data analysis techniques do not uncover this kind of 'interaction' between the predictors at all effectively. - To generate prediction rules by which the dependent can be predicted
For example, a bank may have a 'learning' data set of individuals whose use of credit has been measured. This credit-worthiness is the dependent variable. They may also have information for each individual on a large set of predictors - type and length of home residence, education, occupation, length of time in current job and so on. The problem then is to derive a credit-scoring rule by which to calculate the credit worthiness of future applicants.
These objectives are related in that you can't predict anything unless there is a relationship; and once you have modeled the relationship, making predictions is an immediate application.
Recursive partitioning (RP) - giving a sturdier analysis

Dendrogram of MAO activity
The traditional tools for handling both these objectives are 'global fit' methods like multiple regression analysis and neural nets. Recursive partitioning is a completely different way of approaching the problems. An example helps both to illustrate what recursive partitioning is, and ways in which it may be more effective than these global fit methods. The data set is a collection of 586 chemicals, of interest as possible monoamine oxidase inhibitors (MAOIs). Each compound has a measured biological activity on the scale 0, 1, 2 or 3, with 3 being the (desirable) high level of activity. Also recorded on each compound are 156 predictors - the number of occurrences within the compound of each of 156 molecular fragments thought to be possibly related to biological activity.
The analyses below were made using the FIRMPlus recursive partitioning system, produced and distributed by Golden Helix [as Optimus RP]. FIRM (Formal Inference-based Recursive Modeling) is a RP system developed at the University of Minnesota. As the name suggests, FIRMPlus uses the conceptual framework of FIRM, but with major enhancements in both the analysis capability and user interface. [FIRMPlus is sold as Optimus RP Software by Golden Helix, Inc.]
Recursive partitioning (RP) tries to find a single predictor whose different values split the compounds up into subgroups of more homogeneous activity. In this data set, the most successful such split turns out to separate the compounds according to the number of occurrences of molecular feature number 43. The summary statistics are:
- All 586 compounds: average activity = 0.29. Split into three 'daughter' groups:
- 0 occurrences of feature 43: 480 compounds, average activity 0.23;
- 1 occurrence of feature 43: 75 compounds, average activity 0.7; and
- 2 or more occurrences: 31 compounds, average activity 0.2.
This immediately shows up a big difference between RP and linear regression. In regression, if one occurrence is better than none, then two occurrences will be better than one. But this is not the case in this data set. The more active compounds are those with one occurrence; having two or more is no better than having no occurrences.
The 'recursive' part of RP comes in that we now repeat this same partitioning on each of the three 'daughter' groups created by splitting up the total group.
- The first group, those with no occurrences of feature 43, turn out to be split most effectively on the number of occurrences of feature 12. The summary statistics are:
- No occurrences of feature 12: 403 compounds, average activity 0.14; and
- One or more occurrences: 77 compounds, average activity 0.7. - The third group, those with two or more occurrences of feature 43, are split most effectively using a completely different feature - feature 94. This use of different predictors in different 'parts' of the model is another vital difference between RP and 'global fit' methods like multiple regression. Here the summary statistics are:
- No occurrences of feature 94: 28 compounds, mean activity 0.07; and
- One or more occurrences: three compounds, mean activity 1.7.
These last three nuggets of high activity suggest the chemistry of two or more 43s and one or more 94s might lead to new, effective MAOI drugs. These splitting results are most easily seen in the 'dendrogram', which is shown as Figure 1 (above).
Each box or 'node' represents one group of chemicals. The line 'n=' gives the number of chemicals in that node, and the 'u=' line gives their average activity. Not only does the dendrogram clearly show the paths that lead to high activity and low activity compounds, it can also be used to predict. To predict the MAOI activity of some future compound, simply note the number of occurrences of feature 43 and use this to figure out which way to make the first branch. Then, depending on whether this takes you down the left, the center, or the right branch, note the number of occurrences of feature 12 (left branch) or feature 94 (right branch). Use the average activity of the compounds in the 'node' you end up in as the prediction for the new compound.
The full RP model goes on for a number of further splits. In the process, it shows up yet another feature that cannot easily be captured using traditional modeling methods - that there are several high-activity nodes in completely different parts of the tree. What this tells us is that there is no single 'right' chemistry that leads to high MAOI activity; but rather that there are different mechanisms.
This tree illustrates the most striking strengths of analysis using RP, and why it has been so successful in sharpening the drug discovery process. RP is much more flexible than methods like multiple regression. It allows any sort of relationship between a predictor and the response - not just a straight line. It is happy to use different predictors in different parts of the tree, reflecting that one predictor may be relevant in one subgroup, and a different predictor in another subgroup.
There are some other strengths too. Missing information on predictors bedevils many methods of analysis; in FIRMPlus though, 'missing' is regarded as just another possible value of a predictor. This allows it to correctly handle 'predictive missingness' - the situation in which the fact that a predictor is missing is, in and of itself, helpful in predicting the dependent variable. For example 'current balance in your check account' might be used as a predictor in a credit score. What do you do with an applicant who does not have a check account? You could say that having no check account is the same as a check account with a balance of zero, but this is probably not true. You could use a 'missing at random' method that implied that not having a check account was the same as having one but accidentally forgetting to record it; this is probably not true either. FIRMPlus's approach automatically finds what check account balance most closely corresponds to the actual credit behavior of the people without such accounts, letting the data rather than some prior assumption guide the handling of the missing information.
Multivariate RP
Some problems involve more than one dependent variable. If you have several dependent variables, you can obviously make separate analyses of each, but this can miss important effects. Suppose, for example, you measure people's height and weight. Separate analyses of height and of weight would not be able to identify obesity, which is defined, not by weight alone, but by a departure from the normal height-weight analysis.
FIRMPlus has a multivariate dependent variable capability that looks at the dependent variables simultaneously. This allows you to find both the patterns you would find in separate analyses of the individual dependent variables, and also those that separate analyses would miss because they can be seen only in a multivariate setting. While an alert user would supplement the weight and height dependents with a third - body mass index - and be able to uncover obesity in a set of univariate dependent analyses, multivariate FIRMPlus analysis makes this unnecessary, as it can uncover unanticipated as well as expected departures from the normal patterns of relationship.
Other FIRMPlus capabilities
Sometimes a collaboration between regression and RP is better than either method alone and, in line with this, FIRMPlus can combine the methods by regressing out a covariate, and then using RP to capture the finer structure. Another important, though more specialized, capability is that for generating, not a single dendrogram tree, but a whole forest of trees which together give a richer picture of the data than you get from a single tree.
FIRMPlus application areas
There are numerous examples of successful applications of RP in widely differing subject matter areas. Credit scoring is a setting where 'predictive missingness' and complex mechanisms make the FIRMPlus modeling flexibility invaluable. In geology, it can help isolate structures and conditions associated with mineral enrichment. It can be helpful in understanding failure modes of complex electronic assemblies.
In the light of current concern about drug-resistant infections that patients pick up as a result of stays in hospitals, it is worth mentioning that as early as 1981, an article in the American Journal of Medicine used RP to mine a large (by the standards of the day) data set of the occurrence of these 'nosocomial' infections and possible predictors describing the hospitals in the data base.
About the author
Professor Douglas Hawkins has been at the School of Statistics, University of Minnesota since 1986. From 1994-2000 he was Chair of Applied Statistics. He received many awards and honors in the field of applied statistics. In 1998 he received the annual 'Statistics in Chemistry' award from the American Statistical Association, for his paper 'Analysis of a large structure-activity data set using recursive partitioning' (D. M. Hawkins, S. S. Young and A. Rusinko III). He also serves on the Board of Directors of Golden Helix, Inc.