DETERMINING THE CORRECT NUMBER OF PRINCIPAL COMPONENTS TO USE FOR PCA ANALYSIS
Finding an association can be confuted by population stratification or variations in test equipment collectively referred to as batch effects. This is because a condition may be more prevalent in one group than another, resulting in spurious associations between the condition or trait being tested and genetic characteristics that vary between the groups.
HelixTree and CNAM use an enhanced version of Eigenstrat-based principal component analysis (PCA) to subtract patterns in your data caused by stratification and batch effects. By using this method, the influence on associations resulting from stratification can be minimized or eliminated altogether.
But an important question arises: How many principal components (PCs) should you use in the analysis? If you choose too many, you may wind up subtracting out all effects, thus getting nothing from your tests.
This tutorial covers a heuristic outlined by Mu Zhu et al. (2006) to help you determine the correct number of principal components to use for both SNP and CNV analysis based on a scree plot of eigenvalues.
Overview:
- 1. Run PCA with the number of PCs equal to the total number of samples minus one.
- 2. Compute and plot the log of eigenvalues.
- 3. Find the position of the "elbow" in the scree plot.
REQUIREMENTS
To complete this tutorial, you will need the following software:
Prerequisite Knowledge:
Intermediate SVS functionality
References
1. M Zhu, et al. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis 51 (2006) 918 – 930.
TABLE OF CONTENTS |
|
| ›› | Introduction |
| Compute and Plot the Log of Eigenvalues |
|
| Find the Position of the "Elbow" in the Scree Plot |