Within a separate window or while performing genotype association tests, you have the option to calculate a number of basic genotype statistics for each marker, including call rate, minor allele frequency, Hardy-Weinberg Equilibrium (HWE) P-value, Fisher’s Exact Test for HWE P-value, Signed HWE Correlation R, and allele and genotype counts. Also included in the latest version is a global sample test to detect departures from Hardy-Weinberg Equilibrium within a single proband or case as in a population based-association study.
In addition to just calculating genotype statistics, SVS 7 provides single-step data filtering to exclude markers that do not meet specified quality control thresholds for each supported genotype statistic. Additionally, you can activate columns or rows in one spreadsheet based on row labels in another spreadsheet. This makes it possible to filter markers based on genotype statistics of another sample set (e.g. HapMap minor allele frequencies) or custom statistics not automatically generated in the program. Filtering samples based on call rate is also available.
The latest version of Golden Helix PBAT incorporates a novel test that assesses the genotyping quality of individual probands in family-based association studies. Published in PLoS Genetics [Fardo, 2009] these tests are “ideally suited as the final layer of quality control filters in the cleaning process of genome-wide association studies." You can also assess Mendelian errors, Hardy-Weinberg Equilibrium and call rates per marker.
Originally pioneered by the Broad Institute, SVS 7 uses an enhanced version of Eigenstrat-based principal component analysis to subtract patterns in data caused by population stratification and batch effects. The working premise is that if there is population stratification or batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. The PCA approach in SVS 7 can be used to not only detect these patterns of variation but to correct them as well, thus minimizing or eliminating altogether the influence of population stratification or batch effects. This approach works for both genotypes and numeric values (e.g. copy number log ratios).
» More about Principal Component Analysis
By outputting chi-squared and expected values when performing association tests, you can create Q-Q plots to compare expected vs. observed values. In general good Q-Qplots follows y=x except for true positives at upper right. Large deviations from the expected y=x can most likely signify batch effects, poorly called genotypes, or some other quality issue.
To validate the results of an association test it is beneficial to investigate the individual probe intensities of your top-hits to ensure they are not artifacts of poorly called genotypes. Upon attaining A and B allele probe intensities you can create an XY scatter plot to assess the validity of genotype calls. If you have access to Affymetrix CEL files SVS 7 provides and option to output an allele probe intensity spreadsheet during import.
» Download SNP Cluster Plots Script
SVS 7 makes it easy to check concordance between two data sets. This is useful when you want to check the validity of an assay run on the same set of samples multiple times. SNP concordance is also helfpul when comparing called and imputed or inferred genotypes, genotype calls between different platforms, stated vs. imputed gender, and more.
This somewhat older method, pioneered by Devlin and Roeder [Devlin and Roeder 1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than it should be. This will result in its median being higher than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the bottom line is that the distribution will usually be uniformly spread out by a certain “inflation factor” λ. The technique of Genomic Control measures this “inflation factor” λ by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and λ is taken to be one.
© 2009 Golden Helix, Inc. All Rights Reserved