Principal Component Analysis ("Eigenstrat" method)
Correct input data by principal components analysis using an enhanced "EIGENSTRAT" method
![]()
This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price, 2006]. A more thorough discussion of stratification, principal components analysis, and the eigenvalues involved may be found in [Patterson, 2006].
NOTE: The technique implemented by HelixTree is in the fashion that is used by the “EIGENSTRAT” program, plus a few enhancements.
Working Premise
If there is population stratification (or even variations in test equipment), this pattern of variation over the test subjects should manifest within or influence to a greater or lesser degree the data of many different markers. By contrast, any association with the phenotype in question should manifest only over one or over a very few markers.
If there can be found a way to subtract out any pattern or patterns resulting from stratification and just test the resulting data that does not have these patterns, the influence on the association testing from the stratification should be eliminated or minimized.
The PCA Technique
The object is to first extract the different patterns of variation over the subjects from within the markers, then to subtract out these patterns.
First, a numeric equivalent to each genotype is established, depending upon whether you are using the additive, dominant, or recessive model. (The PCA technique is not available for the other two genetic models, since establishing a numeric equivalent is much more questionable or out of the question for those models.)
Then the following procedure is followed for each marker:
- The average value over this one marker of this numeric equivalent is determined.
- This average value is subtracted from the value for each genotype. The result is the total pattern of variation for this marker.
- This pattern might be “normalized” by (divided by) a value determined from this marker’s data. The possibilities for this value in HelixTree are:
- The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). By this is meant what the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. This is the standard method used by the “EIGENSTRAT” program.
- The actual standard deviation of this marker’s data.
- Don’t normalize.
These (possibly normalized) individual marker patterns over the n subjects are then “summed up” into an nXn matrix, referred to as a “Wishart” or “Wishart-like” matrix, or just as “XTX”.
(Note that the effects of normalizing or not normalizing the pattern data are simply to weight the contributions of the various markers to the Wishart matrix differently.)
The “eigenvectors” and “eigenvalues” of this matrix are called its “components”, and the “eigenvectors” corresponding with its largest “eigenvalues” are called its “principal components”.
It so happens that (according to the working premise of the “EIGENSTRAT” PCA technique) the first one or first few of these principal components correspond directly to the stratification pattern(s), because of the prevalence of this pattern of or these patterns of stratification over so many markers.
Therefore, the final two steps of stratification correction through the “EIGENSTRAT” PCA technique are to find the top k principal components (where you select k) and “subtract out” these components/patterns from both the marker data and the dependent variable using a vector-analysis-related technique.
For a more precise and “mathematical” explanation of this process, please see the Formulas and Theories chapter of the HelixTree Manual
How Many Components to Use?
Determining this is to some degree an open question. If you choose too many, you may wind up subtracting out all effects, thus getting nothing from your tests.
We recommend a heuristic approach by Mu Zhu et al. (2006) which we've augmented to use HelixTree to compute and plot the log of eigenvalues and then use HelixTree's segmenting algorithm to find the position of the "elbow" on a a scree plot. The point of the elbow is essentially the number of principal components to use. The following tutorial will lead you through this approach:
A Second Answer
A second answer to “How many principal components to use?” is to try various numbers of components, but also to apply Genomic Control to the outputs of the un-PCA-corrected tests and these PCA-input-corrected tests. If the inflation factor λ is much lower, say, for two components than it is either for un-PCA-corrected testing or for one component or three or four or five components, it can be ascertained that two components probably works the best for your test data.
Removing Outlier Subjects
If you are interested in what the principal components themselves are, or wish to be more sure you have corrected for stratification through PCA, there is the option of repeating the determination of the principal components after removing from the analysis patients or subjects who are found to have extreme values in one or more of the previously-determined principal components.
HelixTree allows you to optionally do this. Select how many principal components should be involved in this process, how many standard deviations in all of these components a patient or subject should be within, and how many times to repeat this process. The final testing and principal components spreadsheet will avoid using any of the outlier subjects, who will be logged in a separate spreadsheet.
PCA-Corrected Association Testing
The only test available, once you have corrected the input data for stratification using Principal Components Analysis, is the Correlation/Trend Test.
This is because this test is based on numeric values for both the predictor and the response. Thus, even if these numeric values are corrected from, for instance, the “simple” values derived from the genotypes of zero, one, and maybe two, or the “simple” values derived from a case/control variable of zero or one, this test may still be run.
References
1. M Zhu, et al. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis 51 (2006) 918 – 930.
2. Price, Alkes L., Patterson, Nick J. Plenge, Robert M. Weinblatt, Michael E. Shadick, Nancy A. Reich, David. (2006). ’Principal Components Analysis Corrects for Statification in Genome-Wide Asssociation Studies’. Nature Genetics 38, 904-909.
3. Patterson N, Price AL, Reich D (2006) Population Structure and Eigenanalysis PLoS Genet 2(12): e190. doi:10.1371/journal.pgen.0020190.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |





