Correction for Stratification

Sometimes, finding an association can be confuted by population stratification. This is because a condition may be more prevalent in one group of people than in a different group, and therefore there will be a spurious association between the condition or trait being tested for and any genetic characteristics which vary between the two different groups of people (or between two “races”).

While it is good practice that studies always be based upon as homogenous a group of test subjects as possible, it has been noted in [Price 2006] that even the mild variation in genetic characteristics among those who classify themselves as “Caucasian” can be problematic enough to confound a study done over thousands of genetic markers.

Additionally, it is noted in [Price 2006] that there is evidence that variations in test equipment can also easily confound studies.

Two methods are available in HelixTree for correction for population stratification:

  • Correction of Input Data by Principal Components Analysis
  • Correction of Output Chi-Square Values and P-Values by Genomic Control

18.6.1 Correction of Input Data by Principal Components Analysis

This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price 2006]. A more thorough discussion of stratification, principal components analysis, and the eigenvalues involved may be found in [Patterson 2006].

NOTE: The technique implemented by HelixTree is in the fashion that is used by the “EIGENSTRAT” program, plus a few enhancements.

18.6.2 Working Premise

If there is population stratification (or even variations in test equipment), this pattern of variation over the test subjects should manifest within or influence to a greater or lesser degree the data of many different markers. By contrast, any association with the phenotype in question should manifest only over one or over a very few markers.

If there can be found a way to subtract out any pattern or patterns resulting from stratification and just test the resulting data that does not have these patterns, the influence on the association testing from the stratification should be eliminated or minimized.

18.6.3 The PCA Technique

The object is to first extract the different patterns of variation over the subjects from within the markers, then to subtract out these patterns.

First, a numeric equivalent to each genotype is established, depending upon whether you are using the additive, dominant, or recessive model. (The PCA technique is not available for the other two genetic models, since establishing a numeric equivalent is much more questionable or out of the question for those models.)

Then the following procedure is followed for each marker:

  • The average value over this one marker of this numeric equivalent is determined.
  • This average value is subtracted from the value for each genotype. The result is the total pattern of variation for this marker.
  • This pattern might be “normalized” by (divided by) a value determined from this marker’s data. The possibilities for this value in HelixTree are:
    • The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). By this is meant what the standard deviation of this marker’s data would have to be over the population if, in the population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are actually measured for this marker. (See 26.23.4.) This is the standard method used by the “EIGENSTRAT” program.
    • The actual standard deviation of this marker’s data.
    • Don’t normalize.

These (possibly normalized) individual marker patterns over the n subjects are then “summed up” into an nXn matrix, referred to as a “Wishart” or “Wishart-like” matrix, or just as “XTX”.

(Note that the effects of normalizing or not normalizing the pattern data are simply to weight the contributions of the various markers to the Wishart matrix differently.)

The “eigenvectors” and “eigenvalues” of this matrix are called its “components”, and the “eigenvectors” corresponding with its largest “eigenvalues” are called its “principal components”.

It so happens that (according to the working premise of the “EIGENSTRAT” PCA technique) the first one or first few of these principal components correspond directly to the stratification pattern(s), because of the prevalence of this pattern of or these patterns of stratification over so many markers.

Therefore, the final two steps of stratification correction through the “EIGENSTRAT” PCA technique are to find the top k principal components (where you select k) and “subtract out” these components/patterns from both the marker data and the dependent variable using a vector-analysis-related technique.

For a more precise and “mathematical” explanation of this process, please see 26.23 in the Formulas and Theories chapter.

18.6.4 How Many Components to Use?

This is somewhat of an open question.

First off, if you choose as many components as there are markers, if that’s possible, you will wind up subtracting out ALL effects, thus getting nothing from your tests!

The best answer consists of first simply obtaining the components themselves and their corresponding eigenvectors. (Do this either while running uncorrected tests or from the separate PCA window–see 18.9.)

Then look at the pattern of the eigenvalues. If the first few are very large compared with the remaining eigenvalues, then use that many components in a second analysis in which you DO apply the PCA technique.

18.6.5 A Second Answer

A second answer to “How many principal components to use?” is to try various numbers of components, but also to apply Genomic Control (18.6.8) to the outputs of the un-PCA-corrected tests and these PCA-input-corrected tests. If the inflation factor λ is much lower, say, for two components than it is either for un-PCA-corrected testing or for one component or three or four or five components, it can be ascertained that two components probably works the best for your test data.

18.6.6 Removing Outlier Subjects

If you are interested in what the principal components themselves are, or wish to be more sure you have corrected for stratification through PCA, there is the option of repeating the determination of the principal components after removing from the analysis patients or subjects who are found to have extreme values in one or more of the previously-determined principal components.

HelixTree allows you to optionally do this. Select how many principal components should be involved in this process, how many standard deviations in all of these components a patient or subject should be within, and how many times to repeat this process. The final testing and principal components spreadsheet will avoid using any of the outlier subjects, who will be logged in a separate spreadsheet.

18.6.7 PCA-Corrected Association Testing

The only test available, once you have corrected the input data for stratification using Principal Components Analysis, is the Correlation/Trend Test (18.3.1).

This is because this test is based on numeric values for both the predictor and the response. Thus, even if these numeric values are corrected from, for instance, the “simple” values derived from the genotypes of zero, one, and maybe two, or the “simple” values derived from a case/control variable of zero or one, this test may still be run.

18.6.8 Correction of Output Chi-Square Values and P-Values by Genomic Control

This somewhat older method, pioneered by Devlin and Roeder [Devlin and Roeder 1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than it should be. This will result in its median being higher than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the bottom line is that the distribution will usually be uniformly spread out by a certain “inflation factor” λ.

The technique of Genomic Control measures this “inflation factor” λ by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and λ is taken to be one.

Then, Genomic Control applies its correction by dividing the actual association test chi-square statistic results by this λ, thus possibly making these results appropriately more pessimistic.

Two approaches exist for this:

  • (Appropriate for studies over a small number of markers:) Measure the “inflation factor” λ over a set of markers designed to indicate population stratification. Then use this λ on the actual association test (presumably done for just a few candidate markers).
  • (Appropriate for whole-genome scans or a large number of markers:) Measure the “inflation factor” λ over the actual association tests being done. Then afterward, use this λ on all chi-square results so obtained.

HelixTree facilitates both approaches.

Select Show Inflation Factor (Lambda), Chi-Squares, and Corrected Values to find inflation factors (λ) and the results of applying the Genomic Control technique on chi-squares, p-values, Bonferroni-adjusted p-values, and False Discovery Rates.

Select Correct Using This Inflation Factor (Lambda) Instead: and enter a λ value to use an “inflation factor” that was determined from a previous association test run or other previous data.

NOTE: The inflation factor relates to the chi-square statistic. After a chi-square statistic has been corrected through Genomic Control, the normal procedure for finding the approximate p-value is still followed. If there had been “inflation”, the GC-corrected p-value will be pushed up closer to one.