2.2 Recursive Partitioning Primer

RP uses a set of data and, based on some criterion, partitions or splits the original set into smaller sets. These smaller sets are, in turn, split into still smaller sets. This process continues (recursively) until additional splitting of the data into smaller sets gives no statistically meaningful information.

A visual metaphor used frequently when discussing RP and used heavily in ChemTree is that of a tree. The original data set represents the root node of the tree. The sets produced by the first partition represents the children nodes. The sets produced by the next partition represents the grandchildren nodes, etc. The leaf nodes represent the smallest sets, the results of the last partitioning.

Compound data used for RP analysis by ChemTree must first go through the import process resulting in a table of compound properties. (Compounds being potentially selected by the Cherry Picking module need not be imported.)

Each column of data in the table represents a compound property, which could be a descriptor, a potency, or some other property, and has a name and a type associated with it. As ’name’ and ’type’ are attributes of variables, we will use the words column and variable interchangeably throughout the manual. A row represents data for one compound, which acts as one “sample” of data.

In order to perform RP on a data set, dependent and independent variables have to be identified. Why? Recall that we are using RP to build a model (visually represented by a tree) which will ultimately be used to make predictions. We make predictions about something that we’d like to know based on what we do know. What we do know, the predictors, are identified in ChemTree as the independent variables. The dependent variable(s) is (are) what we are trying to predict. An example of a dependent variable would be a compound potency.

When we first build the model, we must provide actual values for the dependent variable(s). In general, the more samples (rows) in our data set that have values for the dependent variable(s), the more accurate the model will be.

The independent variables are used to partition the data set. Each generation of the tree (children nodes, grandchildren nodes, etc.) is produced by splitting the data set associated with the parent node using a specific independent variable. (The criteria for identifying which independent variable to use is discussed later in the manual.) An independent variable can be used to perform a split only one time in a particular branch of the tree. Each split identifies a rule or decision. An entire branch from the root to a leaf represents a sequence of rules or decisions involving the independent variables that can be used to predict the dependent variable(s). This is illustrated in the following tutorials.