Almost every statistic that is used for significance testing in scientific research comes with a set of assumptions. If the test is applied to data that does not meet those assumptions, then the results of the test may not be valid. The statistics commonly used in genome-wide association studies (GWAS), such as the Fisher’s Exact Test or the Chi-Square Test, have the assumption of independent observations. In the GWAS paradigm, this means that the observed genotypes must come from unrelated subjects in a random mating population. If subjects are related to one another, then their genotypes will of course be correlated, and the distribution of genotypes used in the test will not appropriately reflect the genotype distribution in the population. Consanguinity, or inbreeding, can similarly influence GWAS tests by violating the assumption of random mating.
Admittedly, there is not a completely perfect method to find out if there really is relatedness among pairs of subjects or whether there really has been inbreeding on the part of the population being sampled. But with large numbers of SNPs, we can achieve reasonable estimates of these phenomena and use these estimates for quality assurance. If there are problems from relatedness or inbreeding that we were not aware of, these tests would find them, and researchers would thereby be able to reject the problematic data from such subjects.
There are other quality assurance factors which make it advantageous to check for apparent relatedness or inbreeding in the data that have nothing at all to do with whether relatedness or inbreeding exists among the test subjects. For instance, issues can include unexpected duplicate samples resulting from plating errors, duplicate samples from one of a pair of genotyping chips but not the other one (in the case of dual-array systems like the Affy 500k), or sample contamination. Samples in which these things have occurred will often pass through other quality assurance measures unnoticed.
SVS 7.4 features two new quality assurance tests you can use to assess relatedness and inbreeding, plus one additional feature that can be used in combination with these tests. They are based on industry-standard techniques.
The first test estimates genome-wide Identity by Descent (IBD) between pairs of samples. Identity by Descent is a measure of how many alleles at any marker in each of the two samples came from the same ancestral chromosomes. (This is in contrast to the Identity by State (IBS) measure, which is simply a measure of how many alleles at any marker in each of the two samples happen to be the same, for whatever reason.) We denote the probability that zero, one, or two alleles are identical by descent (“shared IBD”) by the notations P(Z=0), P(Z=1), and P(Z=2), respectively. These probabilities may either refer to given markers or be thought of as sample-wide. We also use a combined measure called PI, which is P(Z=2) plus one-half of P(Z=1). This is the probable number of shared alleles at any given marker.
Using your genotypic data to estimate allele frequencies for markers, SVS 7.4 will “work backward” to compute the most reasonable genome-wide IBD probabilities from the data, assuming it came from an ethnically homogeneous, random-mating population.
The other new test estimates inbreeding coefficients for your data. This test assumes that homozygous genotypes for an individual occur either by chance or because both alleles come from the same ancestor. SVS 7.4 will estimate allele frequencies using your genotypic data, the number of homozygous genotypes that should be expected in each of your samples, and from that, the inbreeding coefficient ƒ for each of your samples.
The additional feature that can, and usually should be used with these tests is LD Pruning. This feature will deactivate (or “prune”) markers that are in linkage disequilibrium with other markers that are left active so that the remaining markers will be independent of each other. LD Pruning should normally be conducted before using either the IBD test or the inbreeding test to avoid biases from groups of correlated markers.
It is important to note the assumptions that must be met in order for both the IBD and the Inbreeding estimates to be valid. Like the GWAS statistics mentioned previously, these methods work best when most of the subjects being tested are unrelated individuals from a random mating population. This also means that the Inbreeding Coefficient estimate assumes that the majority of subjects being tested are not from an inbred population. Both of these algorithms use estimated allele frequencies to determine the expected rate of homozygous and heterozygous genotypes under Hardy-Weinberg Equilibrium and therefore require large enough sample sizes for the allele frequency estimates at each marker to be somewhat accurate. You will not get a good IBD estimate from a dataset consisting of only two subjects. If you only have data for two subjects and want to estimate the relatedness between them, consider joining your data with a public dataset from a similar population using the same genotyping platform.
Now, let us use these tests. First, we apply LD Pruning. To do this, go to the Quality Assurance menu, select Genotype, then select LD Pruning and select the pruning options you desire. Within a moving window (default size of 50, default move increment of 5), pairs of markers will be checked for being in linkage disequilibrium with each other. If such a pair is in LD with an r-squared greater than what you have specified (default 0.5, but stricter thresholds may be desired in practice), the first marker of that pair is deactivated, and will no longer be used for LD comparison with other markers. (In order to save time, we can use the (default) CHM method of checking r-squared between the markers, and we will get results very similar to the “more standard” EM method.) The markers that remain active after this process will be independent of each other at least to the degree specified. Note that if we apply LD Pruning to genome-wide data, there should still be many markers that remain active.
Next, let us apply the Identity by Descent (IBD) test on the active data resulting from LD Pruning. To run this test on your genotypic data, go to the Quality Assurance menu, select Genotype, select Identity by Descent Estimation, and select the output options you desire. The untransformed estimates of P(Z=0), P(Z=1), and P(Z=2) have been bounded between zero and one where the original estimates were not thus bounded. PI = P(Z=1)/2 + P(Z=2) is derived from these untransformed but bounded estimates of P(Z=0), P(Z=1), and P(Z=2). If the square of PI is less than P(Z=2), SVS will calculate more biologically plausible transformed values of the P’s. Output these by selecting Output transformed estimates P*(Z=0), P*(Z=1), and P*(Z=2).
All of the above outputs are in the form of N x N spreadsheets, where N is the number of samples. Such spreadsheets may be visualized in a heatmap in order to see any underlying patterns. A contaminated DNA sample will often appear to be related to the majority of other samples and will be very obvious in a heatmap plot.
To output a pair-wise table of all of these values wherever the value of PI is greater than a certain threshold value, check the “Output all pairs where PI >” checkbox. Enter “0” as the threshold to view all pairs from the data.
If you are using a pedigree spreadsheet, you will be allowed to select or deselect the Use only founders for allele counting checkbox. To make estimates of marker allele frequencies be based only on founder data, select this option. This will result in better estimates of the population allele frequencies for pedigrees.
Suppose you have duplicate samples of genotypic data. The probability that two samples are “identical by descent” for these duplicate samples will be estimated to be very close to 1.
On the other hand, suppose you use two complimentary genotyping chips for each subject and you have duplicate samples from one genotyping array but not from the other array. In this case, the IBD test will estimate a P(Z=0) (“completely non-identical”) near 0.5, a P(Z=1) (“one allele is identical”) near zero, and a P(Z=2) (“completely identical”) near 0.5, where the P(Z=2) will be driven to 0.5 by the duplication from the one array.
Suppose you have contamination of one sample with many others. A heat map from the PI result spreadsheet from the IBD test will show you a high level of “relatedness” between the one sample and the other samples. (NOTE: The estimated PI is sometimes called PI-HAT elsewhere.) This is partly because mixtures of homozygous genotypes, say, AA with GG, will appear to be heterozygous (AG). Heterozygous genotypes can never be completely different from any other genotypes, which will make there appear to be more allele sharing for whatever reason, and will cause estimates of IBD between the two samples to be higher.
Now, let us estimate inbreeding coefficients. In general, it is also recommended to run this test on the active data resulting from LD Pruning. To run this test, go to the Quality Assurance menu, select Genotype, and select Inbreeding Coefficients. Specify whether the Genome is Human or Non-Human as well as the Number of Autosomes. A spreadsheet will be output with one column for the inbreeding coefficient ƒ for each sample and three other columns for the number of markers for each sample, the number of observed homozygotes for each sample, and the number of expected homozygotes for each sample, respectively.
While truly inbred samples should have highly positive inbreeding coefficients, indicating many homozygous genotypes, contaminated samples will appear to have highly negative inbreeding coefficients. These result from the high number of “heterozygous” genotypes they appear to contain, resulting from the mixture of different alleles in such samples.
In order to show you what some of these problems might look like in the data, I have simulated a dataset containing two of these quality assurance problems. I have made samples 5 and 6 to be almost duplicate, and I have simulated contamination of samples 11 through 23 by sample 10. The following are the results from this simulation. NOTE: I have simulated a rather small dataset here so I could easily exaggerate the results. However, these tests are really meant to be used with LD-pruned whole-genome data, especially if the tests are meant for any purpose other than data quality assurance.
We can see in the heat map of PI that there is expressed a very high level of relatedness between samples 5 and 6. Also, we see how there is greater relatedness showing between sample 10 and samples 11 through 23 (as well as greater relatedness showing among the samples 11 through 23). Heat maps taken of the other spreadsheet outputs will corroborate these same results.
We also see many quite negative inbreeding coefficients in samples 11 through 23 in the inbreeding output spreadsheet, as would be expected.
In summary, it is well to run the IBD and inbreeding tests (both with LD pruning) over your genotypic data, and if any values look suspicious, whatever the reason might be, do not use them in further analysis. …And that’s my two SNPs.