Recursive Partitioning Primer

Recursive partitioning uses a set of data and, based on some criterion, partitions or splits the original set into smaller sets. These smaller sets are, in turn, split into still smaller sets. This process continues (recursively) until additional splitting of the data into smaller sets gives no statistically meaningful information.

A visual metaphor used frequently when discussing RP and used heavily in HelixTree is that of a tree. The original data set represents the root node of the tree. The sets produced by the first partition represents the children nodes. The sets produced by the next partition represents the grandchildren nodes, etc. The leaf nodes represent the smallest sets, the results of the last partitioning.

In order to perform RP on a data set, dependent and independent variables have to be identified. Why? Recall that we are using RP to build a model (visually represented by a tree) which will ultimately be used to make predictions. We make predictions about something that we’d like to know based on what we do know. What we do know, the predictors, are identified in HelixTree as the independent variables. The dependent variable(s) is (are) what we are trying to predict.

The data imported into HelixTree for analysis must be in a tabular format. Each column of data has a name and a type associated with it. As ’name’ and ’type’ are attributes of variables, we will use the words column and variable interchangeably throughout the manual. A row is one sample in the data set.

When we first build the model, we must provide actual values for the dependent variable(s). In general, the more samples (rows) in our data set that have values for the dependent variable(s), the more accurate the model will be.

The independent variables are used to partition the data set. Each generation of the tree (children nodes, grandchildren nodes, etc.) is produced by splitting the data set associated with the parent node using a specific independent variable. (The criteria for identifying which independent variable to use are discussed later in the manual.) An independent variable can be used to perform a split only one time in a particular branch of the tree. Each split identifies a rule or decision. An entire branch from the root to a leaf represents a sequence of rules or decisions involving the independent variables that can be used to predict the dependent variable(s). This is illustrated in the following tutorials.