Recursive Partitioning Technology Review for Genetic Association Analysis
There is considerable evidence many diseases have a genetic component, but classic statistical approaches for finding these genes have not worked well for diseases with complex etiologies such as diabetes, schizophrenia, Alzheimer's and depression. Various explanations for this lack of success have been put forward: small sample sizes, multiple genes with small gene effects, complex genetic interactions (epistasis), interactions with environmental effects, multiple mechanisms, etc.
There is hope genome-wide association studies (GWAS) with scans of hundreds of thousands of single nucleotide polymorphism (SNPs) will provide the data necessary to unravel the problem. However, there are relatively few established methods described in the literature for how to deal with these data sets, and there are few examples of multiple gene findings. Novel statistical methods that deal with large complex data sets and incorporate interaction analysis are needed.
Recursive Partitioning (RP)
RP is a powerful method for genetic association studies (Young S, 2005), (Zaykin, 2005). RP can be viewed as conditional gene finding. Once a split is made based upon one gene, then the subsequent analysis is conditional on the presence or absence of that gene. Examination of just about any biochemical pathway suggests that the effect of a gene will be dependent on the presence/absence of other genes. Quite often, there are alternative paths and two or more paths must be upset for the biochemistry to become unbalanced. So it is likely that the search for genes should be conditional. Some researchers are beginning to investigate conditional gene finding methods using RP (Warren , 2006), (Wu, 2006), (Spraggs, 2005).
Recursive partitioning is most easily described using an example. (See Young, 2005 for a more comprehensive example.) The basic statistical motive is to divide a data set into parts where the objects in the parts are more alike. Consider the figure below, a hypothetical example. The RP diagram is read in the following way. In the parent node, N, there are 2000 individuals and 212 of them have the disease. (Click here for a blood pressure example using HelixTree.)
A gene scan is done over a number of candidate (in this example bi-allelic) genes and it is determined that if the person is homozygous 11 for Gene i that there are 400 such individuals and 100 of them have the disease. These individuals become Node N1. Of the remaining 1600 individuals, 112 have the disease, Node N0. At this point in the analysis the original data set has been divided into two groups. A gene scan is done over the 1600 individuals in Node N0 and it is determined that 400 are 11 or 10 for Gene j and 100 of those people have the disease. These individuals form Node N01. The remaining 1200 individuals are in Node N00 and 12 of these individuals have the disease. Nodes N1 and N01 can be viewed as two distinct forms of the disease based upon Genes i and j.
The analysis is conditional in nature. If we attempt to split node N1, the search is in the 400 individuals that are homozygous 11 for gene i, so the search for a new gene is dependent on its effect with genotype 11 for gene i.
One of the most useful aspects of this analysis is that the results are very easy to understand. Each piece of the analysis is visually presented in a local context. Subject matter experts find the analysis presentation clear and almost always suggestive of new hypotheses.
In this case a clinical trials expert might suggest that in the study of this disease one should determine the efficacy of a drug separately for each disease genotype. Individuals without either 11 for Gene i or 11 or 10 for Gene j might be excluded from the trial. One reason they can think clearly about the data is that it has been divided into smaller groups with defining characteristics.
Behind the visual display of the analysis are computer science algorithms and statistical theory. How do you group classes of individuals?
In the first split we group 10 and 00 and in the second split we group 11 and 10. Over all the genes, how do you select gene i for the first split? How do you know when to stop the splitting process? In this hypothetical example we make two splits. In a real example, the splitting would continue as long as it was sensible. The statistical model of the data is the resulting tree. Predictions are made by moving the individual down the tree right or left at each node depending upon the characteristics of the individual. Predictions are easy to understand by subject matter experts and individuals. The analysis and prediction processes are very transparent. In addition, it should be straightforward to include environmental variables so that genotype by environmental interactions should be detectable by the same general methods.
Recursive partitioning has been highly successful in analyzing single dependent variables that are determined by complex, interacting models, and/or mixtures of models. When there are several dependent measures, one can obviously apply recursive modeling to each of them individually, or to univariate composite score functions as are used in some multi-criterion decision-making methods. To do so, however, removes the possibility of discovering inherently multivariate patterns.
This leads to the concept of multivariate recursive partitioning, which is implemented in HelixTree. With multivariate RP, a trial outcome for a drug can then be represented in terms of several efficacy and side effect measures. The conceptual underpinnings of multivariate inference-based recursive partitioning are sketched in Hawkins and Kass (1982).
REFERENCES
1. Hawkins, D.M. and Kass, G.V., (1982) Automatic Interaction Detection. In Topics in Applied Multivariate Analysis; Hawkins, D. H., Ed.; Cambridge University Press, pp. 269-302. [Describes hypothesis based splitting for recursive partitioning. This approach gives great speed advantages over CART in that complex tree pruning is not necessary.]
2. Hawkins, D.M., Young, S.S., and Rusinko, A. III (1997) "Analysis of a large structure-activity data set using recursive partitioning," Quant. Struct.-Act. Relat. 1997, 16, 296-302. [This paper won the best statistics paper for chemistry in 1998 from the American Statistical Association. It shows that recursive partitioning is capable of analyzing mixture data sets and finding multiple mechanisms.]
3. Rusinko, A. III, Farmen, M.W., Lambert, C.G., Brown, P.L., and Young, S.S. (1999) "Analysis of a large structure/biological activity data set using recursive partitioning," J. Chem. Inf. Comput. Science v. 39 no. 6 pp. 1017-1026. [This paper gives algorithms for fast recursive partitioning when the descriptor variables are 0/1.]
4. Spraggs C., Pillai S., Dow D., Douglas C., McCarthy L., Manasco P., Stubbins, M. and Roses A. "Pharmacogenetics and obesity: common gene variants influence weight loss response of the norepinephrine/dopamine transporter inhibitor GW320659 in obese subjects." Pharmacogenetics and Genomics, December 2005, Vol. 15, 883-889.
5. Warren L., Hughes A., Lai E., Zaykin D., Haneline S., Bansal A., Wooster A., Spreen W., Hernandez J., Scott T., Roses A., Mosteller M. "Use of pairwise marker combination and recursive partitioning in a pharmacogenetic genome-wide scan." The Pharmacogenomics Journal (2006), 1-10. September 12, 2006. (Abstract)
6. Wu X., Gu J., Grossman H., Amos C., Etzel C., Huang M., Zhang Q., Millikan R., Lerner S., Dinney C., and Spitz M. "Bladder Cancer Predisposition: A Multigenic Approach to DNA-Repair and Cell-Cycle-Control Genes." American Journal of Human Genetics, March 2006, 78:464-479.
7. Young S., Ge N. "Recursive partitioning analysis of complex disease pharmacogenetics studies. I. Motivation and overview." Pharmacogenomics January 2005, Vol. 6, No. 1, 65-75.
8. Zaykin D., Young S. "Large recursive partitioning analysis of complex disease pharmacogenetic studies. II. Statistical considerations." Pharmacogenomics, January 2005, Vol. 6, No. 1, 77-89 View Feature List
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |





