Tree Analysis Overview

HelixTree’s tree analysis engine is based on Formal Inference Recursive Modeling (FIRM) technology. FIRM has its roots going back to work done in the 1970’s and 1980’s by Dr. Douglas Hawkins ( http://www.douglashawkins.com).

Early recursive partitioning approaches, such as AID, suffered from a lack of statistical rigor. Dr. Hawkins introduced statistical hypothesis testing as a means for better characterizing the statistical validity of the models generated. FIRM was released in the early 1980’s as a non-GUI package and is still in use today.

HelixTree has taken the statistical foundations of FIRM and augmented it with faster and more exact segmenting algorithms. It has also extended FIRM methods to include multivariate response. We are grateful for the continued assistance of Dr. Hawkins in devising and improving many of the statistical and algorithmic methods underlying HelixTree.

The best way to gain a working understanding of tree analysis is to go through an example.

NOTE: In the rest of this chapter, we generally assume that only one variable (column) has been designated in the spreadsheet to be dependent, and that if that dependent variable is categorical, that it contains only two categories. See 10 for a discussion of multiple dependent variables and multi-category dependent variables.


[Picture]
Figure 7.1: This is the sample tree described in the text.

A Sample of a Tree Analysis

The tree in Fig. 7.1 is from the GSIM data set included in the HelixTree release.

This data set simulates the effect of a blood pressure treatment on 1000 individuals. The measured response is diastolic blood pressure (BP), measured after a hypothetical drug treatment regime. There are several environmental variables that are introduced that mitigate activity of the simulated treatment and therefore affect the response variable.

To begin creating this tree, open a new project, import the GSIM.ghd legacy file, and then open the spreadsheet view. From the spreadsheet view, inactivate the BP_I column (gray it out), and set BP as the dependent variable. From the spreadsheet click on the Analysis->Interactive Tree Analysis menu items to bring up the tree view.

Right click on the tree node and do a Manual Split on Sex node. This split produces two subgroups - a subgroup of females with a lower blood pressure than the male subgroup (mean value 93.4 vs. 97.8).

Next do a Manual Split on Smoke? from the Sex F node. This splits the females into a smoking subgroup and non-smoking subgroup. The smokers have a higher blood pressure.

Finally, do a Manual Split on Gen_A from the YES Smoke? node. We find among female smokers that Gene A is a significant variable. The homozygous 1_1 patients have a higher blood pressure than the homozygous 2_2 or heterozygous patients. Interestingly, Gene A does not show up at the root node as a significant main effect. It is only significant within the female smokers subgroup – a true interaction effect.

All of the effects are not necessarily found in a single tree. For example, at a given node there may be more than one significant splitter, but only one of these can be used as a splitter. This is why multiple trees are usually interactively explored or sampled using the random tree creation menu described later.