‹‹ Back to SVS Home

Principal Component Analysis Overview

7.7 Principal Component Analysis Overview

Correcting for Stratification

Sometimes finding an association can be confounded by population stratification. This is because a condition may be more prevalent in one group of people than in a different group, resulting in a spurious association between the condition or trait being tested for and any genetic characteristics which vary between the two different groups of people.

While it is good practice for studies to be based on as homogeneous a group of test subjects as possible, it has been noted in [Price 2006] that even the mild variation in genetic characteristics among those who classify themselves as belonging to one ethnic group or another can be problematic enough to confound a study done over thousands of genetic markers.

Two methods are available in SVS for correction for population stratification:

Correcting for Batch Effects and Other Measurement Errors

It is noted in [Price 2006] there is evidence that variations in test equipment can easily introduce bias into studies.

Additionally, with Copy Number Variation (CNV) studies, variation between batches (batch effects) may easily affect the association test results. 

Principal Component Analysis may be used to correct for measurement variations in the input data, as well as for population stratification.

Correction of Input Data by Principal Component Analysis

This technique, pioneered at the Broad Institute, distributes a program called “EIGENSTRAT” which implements the PCA correction technique, as described in [Price 2006]. A more thorough discussion of stratification, principal component analysis, and the eigenvalues involved may be found in [Patterson 2006].

NOTE:

  • The technique implemented in SVS uses similar methodology as in the “EIGENSTRAT” program, with several enhancements.

Working Premise

If there is population stratification or there are batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. By contrast, any association with the phenotype in question should manifest only over one or very few markers.

If one can find a way to remove any pattern(s) resulting from stratification or batch effects and test only the data without these patterns, the influence from the stratification or batch effects should be eliminated or minimized.

The PCA Technique

The object is to first determine the different patterns of variation over the subjects from within the markers, then to remove these patterns.

The following procedure is followed for each marker:

  1. The average value for the marker across all samples is determined.
  2. The average value is subtracted from all values for the marker to recenter the data about zero. The result is the total pattern of variation for the marker.
  3. This pattern might be “normalized” or divided by a value determined from the marker’s data. The possibilities for normalization in SVS are:
    • The theoretical standard deviation of the marker’s data at Hardy-Weinberg Equilibrium (HWE). This is what the standard deviation of the data for the marker would be over the entire population if the marker in the population were in HWE and had the same major and minor allele frequencies as observed for the marker.
    • The actual standard deviation of the data for the marker.
    • No normalization.

These (possibly normalized) individual marker patters over the n subjects are then “summed up” into an nXn matrix, referred to as a “Wishart” or “Wishart-like” matrix, or just as “XTX”.

NOTE:

  • The effects of normalizing or not normalizing the pattern data are simply to weight the contributions of the various markers to the Wishart matrix differently.

The “eigenvectors” and “eigenvalues” of this matrix are called its “components”, and the “eigenvectors” corresponding with its largest “eigenvalues” are called its “principal components”.

It so happens that (according to the working premise of the “EIGENSTRAT” PCA technique) the first principal component or the first few principal components correspond directly to the stratification pattern(s), based on the prevalence of this pattern or these patterns of stratification over so many markers.

Therefore, the final two steps of stratification correction through the “EIGENSTRAT” PCA technique are to find the top k principal components (where you select k) and remove these component/patterns from both the marker data and the dependent variable using a vector-analysis related technique.

For a more precise and “mathematical” explanation of this process, please see Formulas for Principal Component Analysis.

How Many Components to Use?

This is, to some degree, an open question as it is a matter of personal preference. Note that if you choose as many components as there are markers, you will wind up subtracting out ALL effects, thus getting nothing from your tests. The maximum number of markers that can be selected is the number of samples less one. If a larger number is selected, then the number of principal components found will still be the number of samples less one.

The best answer consists of first simply obtaining the components themselves and their corresponding eigenvectors for N 1 principal components, where N is the total number of samples in the dataset. This can be done by either running uncorrected tests or from the separate Principal Component Analysis windows in the Quality Assurance menu from a spreadsheet. See Genotypic Principal Component Analysis or Numeric Principal Component Analysis.

Then evaluate the pattern of the eigenvalues in the spreadsheet or plot the eigenvalue variable column. (Right-click on the column header and select Plot Variable.) If the first few are very large compared with the remaining eigenvalues, then use that many components in a second analysis. Another way to determine the number of components to use is to examine cluster plots of the first few eigenvalues plotted against each other, i.e. one versus two, two versus three, and so on. (See Multi-Color Scatter Plots for PCA or Gender Analysis.) When there are no longer discernible clusters of groups based on phenotype categories, then that is the number of components to use for PCA correction.

A Second Answer (for Genotypic Data Only)

The inflation factor can also be used to adjust for stratification. An inflation factor close to one would indicate stratification is not present in the data. The number of principal components can be determined by increasing or decreasing the number of components until the inflation factor is as close to one as possible. Using the inflation factor to determine the number of principal components to use is a variant of Genomic Control. See Correcting for Stratification by Genomic Control for more information.

Removing Outlier Subjects

If you are interested in what the principal components are, or wish to be more sure you have corrected for stratification through PCA, there is the option of repeating the determination of the principal components after removing patients or subjects from analysis who are found to have extreme values in one or more of the previously-determined principal components.

SVS automates this process. Select the number of principal components that should be involved in this process, how many standard deviations in all of these components a patient or subject should be within, and how many times to repeat this process. The final testing and principal components spreadsheet will avoid using any of the outlier subjects, which will be logged in a separate spreadsheet.

PCA-Corrected Association Testing

The only tests available, once you have corrected the input data for stratification using Principal Component Analysis, are the Correlation/Trend Test (Test Statistics) and Logistic/Linear Regression (Test Statistics).

This is because these tests are based on numeric values for both the predictor and the response. Thus, even if these numeric values are corrected from, for instance, the “simple” values derived from the genotypes of zero, one and two, or the “simple” values derived from a case/control variable of zero or one, this test may still be run.