Created:
March 27, 2008

Updated:
July 16, 2008

User Level:

Intermediate

Products:
HelixTree, CNAM, WGA Module


Step 3. Correct for Batch Effects and Stratification

Sometimes, finding an association can be confuted by population stratification. This is because a condition may be more prevalent in one group of samples than in a different group, resulting in spurious associations between the condition or trait being tested for and any genetic characteristics varying between the two different groups.

While it is good practice for studies to be based upon the most homogenous test subjects as possible, it has been noted that even those who classify themselves as “Caucasian” have mild variation in genetic characteristics problematic enough to confound a study done over thousands of genetic markers.

Additionally, it has been our experience, especially with copy number analysis, that there is evidence of variations in test equipment confounding studies. These are referred to as batch effects, which are often the result of improperly randomizing the genotyping of cases and controls, males and females, etc. For example, perhaps all cases (or all females) were done at one site or day and controls (or males) at another.

This tutorial will lead you through correcting LogRs for batch effects and stratification using an Eigenstrat-based principal component analysis (PCA) method. The result will be a new DSF file with PCA corrected LogRs. This file can then be used to perform LogR association tests or copy number segmenting (Step 4 and Step 5).

Determining How Many Principle Components to Use
Before you apply PCA to your LogRs, you need to determine how many principal components to use in the analysis. Determining this is to some degree an open question. If you choose too many, you may wind up subtracting out all effects, thus getting nothing from your tests.

We recommend a heuristic approach by Mu Zhu et al. (2006) which we've augmented to use HelixTree to compute and plot the log of eigenvalues and then use HelixTree's segmenting algorithm to find the position of the "elbow" on a a scree plot. The point of the elbow is essentially the number of principal components to use. The following tutorial will lead you through this approach:

›› Determining the Correct Number of PCs to Use to Correct for Batch Effects and Population Stratification

Applying PCA Correction
Once you have determined how may principle components to use, you can go back into the LogR Association Tests & PCA window (under the CNAM menu) to correct for batch effects and stratification.

From this window, Choose the DSF file you want to correct and Uncheck all the association test boxes.

Next, make sure the Correct Batch Effects/Stratification with PCA box is checked.

You also need to check the Output the PCA adjusted logRs to a DSF box. This allows you to import and segment the PCA corrected data using the Copy Number Segmentation window (Step 5), or use it as the input DSF in this window to run LogR association tests (Step 4) without having to re-run PCA correction each time.

Browse to the path where you want to save the corrected DSF, name it and click Save.

NOTE: It is good to give the DSF file an intuitive name, such as one referring to the number of principal components used. The corrected LogRs will differ depending on how many principal components are used.

Click on the Principal Component Analysis Parameters tab.

Here you need to enter the number of principle components you determined previously. This time it is not necessary to output the Eigenvalue spreadsheet so you can Uncheck this box. Again, for this tutorial don’t worry about PCA outlier removal.

Next, click on the Exclude SNPs tab.

From this tab, you can now exclude the markers you identified in Step 2. Browse to the CSV file created in Step 2 and click Open. A list of markers from the CSV file should now be listed as in Figure 5.

Figure 5. Exclude Markers Tab

When you are finished, click Run.

The result for this particular analysis is a new DSF file containing PCA corrected LogRs minus the excluded markers and a secondary principal component spreadsheet.

From here you can proceed to perform LogR association tests (Step 4) or perform copy number segmentation (Step 5) on the corrected DSF.