Formulas for Principal Components Analysis
This technique was pioneered at the Broad Institute, which distributes a program called “EIGENSTRAT” which implements this technique. They describe the PCA correction technique in [Price 2006]. A more thorough discussion of stratification, principal components analysis, and the eigenvalues involved may be found in [Patterson 2006].
NOTE: HelixTree currently uses PCA with association tests performed against genotypes, using their numeric equivalent values (18), and also with association tests performed against LogR copy number data in the CNAM module (25).
Suppose X is the m (marker) by n (subject) matrix of the numeric equivalent values of the genotypes, with the exception that for each marker, what had been the average value over that marker has first been subtracted out from the marker’s values.
26.23.1 Motivation
The idea is that X can be thought of as decomposed into USV T, where U is an m by n matrix whose kth column contains coordinates of each marker along a kth principal component, S is an n by n singular-value matrix, and V is an n by n matrix of “ancestries” relating the subjects to the singular values.
It turns out that for this analysis, XTX (which is like a Wishart matrix) is a useful matrix for computational purposes. We have

,
but UTU = I, so

.
If XTX is decomposed into its eigenvectors and eigenvalues, we could write

, or

, where λk and vk are the eigenvalues and eigenvectors of XTX.
Thus, we conclude that the vk are “ancestry” vectors, with the most important ones being those corresponding to the highest λk. Our object will be to, in some way, subtract these most important vk out from the rest of the data.
26.23.2 Technique
In actual practice, when accumulating the elements of XTX, the marker data is optionally weighted in favor of those markers with less variance by normalizing them according to their variance, either theoretical or actual. See 26.23.4 regarding marker normalizing.
Otherwise, the technique is to:
- Scan the markers and obtain the Wishart-like XTX matrix.
- Find the desired number of eigenvectors and eigenvalues from XTX.
- If PCA-corrected testing is being performed, apply the PCA corrections, first from the dependent variable and then during the testing scan from each marker of the genetic data just before it is tested.
26.23.3 Applying the PCA Corrections
Suppose r is a vector containing the dependent variable (response) values or numeric-equivalent values. We subtract the average element value from each element to get r0; then,

, where ρ1 = v1ṙ0. Similarly,

, where ρ2 = v2ṙ1, etc., up to

, where ρk = vkṙk-1.
Note that the ρ values are obtained through dot products, where componentwise,

.
We finally get the corrected r by adding back the average element value of the original r to each element of rk.
The idea is that for each principal component, we find out what magnitude of it is present in r (or rk-1)–then we subtract that magnitude of that component from rk-1 to get rk.
Exactly the same technique is applied for every vector g of data from an individual marker. After subtracting the average element value to get g0, we compute

, where γ1 = v1ġ0. Similarly,

, where γ2 = v2ġ1, etc., up to

, where γk = vkġk-1.
Componentwise, we have

.
We finally get the corrected g by adding back the average element value of the original g to each element of gk.
26.23.4 Formulas for PCA Normalization
The possibilities for marker normalization in HelixTree are:
- The theoretical standard deviation of this marker’s data at Hardy-Weinberg equilibrium (HWE). By this is
meant what the standard deviation of this marker’s data would have to be over the population if, in the
population, it were in Hardy-Weinberg equilibrium and had the same major and minor allele frequencies as are
actually measured for this marker. This is the standard method used by the “EIGENSTRAT” program.
This theoretical standard deviation is as follows according to the genetic model being used:
- Additive (the only model supported by “EIGENSTRAT” itself): 2
p(1 - p)), where p = (1 +
∑
gj)∕(2 + 2n) is a bayesian estimate of the minor allele probability in terms of the numeric marker
values gj.
- Dominant:
p(1 - p)), where p = (.5 + ∑
gj)∕(1 + n) is a bayesian estimate of the probability of
being a DD or Dd genotype.
- Recessive:
p(1 - p)), where p = (.5 + ∑
gj)∕(1 + n) is a bayesian estimate of the probability of
being a DD genotype.
- Additive (the only model supported by “EIGENSTRAT” itself): 2
- The actual standard deviation of this marker’s data.
- Don’t normalize.