Never Let the Important Become Urgent: A reflection on the genetics supply chain and our need to increase value to the end patient
» Read blog post
Within a separate window or while performing genotype association tests, you have the option to calculate a number of basic genotype statistics for each marker, including call rate, minor allele frequency, Hardy-Weinberg Equilibrium (HWE) P-value, Fisher’s Exact Test for HWE P-value, Signed HWE Correlation R, and allele and genotype counts. Also included in the latest version is a global sample test to detect departures from Hardy-Weinberg Equilibrium within a single proband or case as in a population based-association study.
In addition to just calculating genotype statistics, SVS 7 provides single-step data filtering to exclude markers that do not meet specified quality control thresholds for each supported genotype statistic. Additionally, you can activate columns or rows in one spreadsheet based on row labels in another spreadsheet. This makes it possible to filter markers based on genotype statistics of another sample set (e.g. HapMap minor allele frequencies) or custom statistics not automatically generated in the program. Filtering samples based on call rate is also available.
The latest version of Golden Helix PBAT incorporates a novel test that assesses the genotyping quality of individual probands in family-based association studies. Published in PLoS Genetics [Fardo, 2009] these tests are “ideally suited as the final layer of quality control filters in the cleaning process of genome-wide association studies." You can also assess Mendelian errors, Hardy-Weinberg Equilibrium and call rates per marker.
Originally pioneered by the Broad Institute, SVS 7 uses an enhanced version of Eigenstrat-based principal component analysis to subtract patterns in data caused by population stratification and batch effects. The working premise is that if there is population stratification or batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. The PCA approach in SVS 7 can be used to not only detect these patterns of variation but to correct them as well, thus minimizing or eliminating altogether the influence of population stratification or batch effects. This approach works for both genotypes and numeric values (e.g. copy number log ratios).
To obtain better results when running certain tests you can quickly filter (prune) correlated markers prior to analysis.
Related individuals wreak havoc on association tests where independence is assumed. Identity by descent (right) and inbreeding coefficient calculations help you control for unknown or cryptic relatedness in your samples.
Identifying outliers in autosome heterozygosity helps detect contaminated DNA samples (and population stratification in some cases).
Several new methods make it easy to verify that a sample’s reported gender is consistent with its inferred gender. These include X chromosome heterozygosity on genotypes, plotting X versus Y intensity values and averaging log ratio values of the X chromosome (especially helpful for identifying gender anomalies).
Calculating the inter-quartile range (IQR) of a numeric distribution is useful for determining outliers for many quality assurance measurements.
An extension of quartile summary statistics, you can use this feature to identify outliers on multiple dimensions, such as samples whose ethnicity does not match that of your study population when examining two or more principal components.
Derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio (LR) data. It correlates with low call rates and over/under abundance of identified copy number segments. Samples with higher values of DLRS tend to have poor signal-to-noise properties and are good candidates to exclude from analysis.
Detecting large chromosomal aberrations is both a quality assurance step and an analysis step. For example, by averaging log ratios across all autosomal chromosomes you can quickly detect cell line artifacts. But you may also be able to detect large aberrations that are instrumental in detecting disease causing loci.
Genomic waves (left) are ubiquitous in copy number data and can cause inaccuracies with any copy number detection algorithm. SVS employs the Diskin, et. al., 2008 method to help you both detect and correct for genomic waves.
Percentile-based winsorizing can be used to prevent segmentation algorithms from being driven by outlier values, resulting in a more accurate determination of regions of copy number variation.
To validate the results of an association test it is beneficial to investigate the individual probe intensities of your top-hits to ensure they are not artifacts of poorly called genotypes. Upon attaining A and B allele probe intensities you can create an XY scatter plot to assess the validity of genotype calls. If you have access to Affymetrix CEL files SVS 7 provides and option to output an allele probe intensity spreadsheet during import.
» Download SNP Cluster Plots Script
SVS 7 makes it easy to check concordance between two data sets. This is useful when you want to check the validity of an assay run on the same set of samples multiple times. SNP concordance is also helfpul when comparing called and imputed or inferred genotypes, genotype calls between different platforms, stated vs. imputed gender, and more.
This somewhat older method, pioneered by Devlin and Roeder [Devlin and Roeder 1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than it should be. This will result in its median being higher than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the bottom line is that the distribution will usually be uniformly spread out by a certain “inflation factor” λ. The technique of Genomic Control measures this “inflation factor” λ by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and λ is taken to be one.