Quality Assurance

High quality data is critical to high quality results. To ensure your data is of the highest quality SVS 7 provides a complete set of quality assurance tools that not only help you assess the quality of your data, but remedy any problems as well.

SNP Analysis Kit

Check out the following webcast as Dr. Christophe Lambert examines a number of effective data prep and QA methods that will help you get more meaningful results without all the headaches.

View Webcast »

Genotype Statistics by Marker and Sample

Within a separate window or while performing genotype association tests, you have the option to calculate a number of basic genotype statistics for each marker, including call rate, minor allele frequency, Hardy-Weinberg Equilibrium (HWE) P-value, Fisher’s Exact Test for HWE P-value, Signed HWE Correlation R, and allele and genotype counts. Also included in the latest version is a global sample test to detect departures from Hardy-Weinberg Equilibrium within a single proband or case as in a population based-association study.

 

 

Genotype Filtering WindowQuality Assurance Filtering

In addition to just calculating genotype statistics, SVS 7 provides single-step data filtering to exclude markers that do not meet specified quality control thresholds for each supported genotype statistic. Additionally, you can activate columns or rows in one spreadsheet based on row labels in another spreadsheet. This makes it possible to filter markers based on genotype statistics of another sample set (e.g. HapMap minor allele frequencies) or custom statistics not automatically generated in the program. Filtering samples based on call rate is also available.

 

Family-Based QC

The latest version of Golden Helix PBAT incorporates a novel test that assesses the genotyping quality of individual probands in family-based association studies. Published in PLoS Genetics [Fardo, 2009] these tests are “ideally suited as the final layer of quality control filters in the cleaning process of genome-wide association studies." You can also assess Mendelian errors, Hardy-Weinberg Equilibrium and call rates per marker.

Comparison of SNP and CNV principal component analysis (PCA) plots.Batch Effect and Population Stratification Correction

Originally pioneered by the Broad Institute, SVS 7 uses an enhanced version of Eigenstrat-based principal component analysis to subtract patterns in data caused by population stratification and batch effects. The working premise is that if there is population stratification or batch effects, this pattern of variation over the samples should manifest within, or influence to a greater or lesser degree, the data of many different markers. The PCA approach in SVS 7 can be used to not only detect these patterns of variation but to correct them as well, thus minimizing or eliminating altogether the influence of population stratification or batch effects. This approach works for both genotypes and numeric values (e.g. copy number log ratios).

Identity by descent matrix

LD Pruning

To obtain better results when running certain tests you can quickly filter (prune) correlated markers prior to analysis.

Identity by Descent and Inbreeding Coeffecient

Related individuals wreak havoc on association tests where independence is assumed. Identity by descent (right) and inbreeding coefficient calculations help you control for unknown or cryptic relatedness in your samples.

Autosome Heterozygosity

Identifying outliers in autosome heterozygosity helps detect contaminated DNA samples (and population stratification in some cases).

Gender Misidentification

Several new methods make it easy to verify that a sample’s reported gender is consistent with its inferred gender. These include X chromosome heterozygosity on genotypes, plotting X versus Y intensity values and averaging log ratio values of the X chromosome (especially helpful for identifying gender anomalies).

PCA plot displaying outliers

Quartile Summary Statistics

Calculating the inter-quartile range (IQR) of a numeric distribution is useful for determining outliers for many quality assurance measurements.

Multidimensional Outlier Detection

An extension of quartile summary statistics, you can use this feature to identify outliers on multiple dimensions, such as samples whose ethnicity does not match that of your study population when examining two or more principal components.

Derivative Log Ratio Spread

Derivative log ratio spread (DLRS) is a measurement of point-to-point consistency or noisiness in log ratio (LR) data. It correlates with low call rates and over/under abundance of identified copy number segments. Samples with higher values of DLRS tend to have poor signal-to-noise properties and are good candidates to exclude from analysis.

Chromosomal Aberration Screening

Detecting large chromosomal aberrations is both a quality assurance step and an analysis step. For example, by averaging log ratios across all autosomal chromosomes you can quickly detect cell line artifacts. But you may also be able to detect large aberrations that are instrumental in detecting disease causing loci.

 

Genomic wave on log ratios

Wave Detection and Correction

Genomic waves (left) are ubiquitous in copy number data and can cause inaccuracies with any copy number detection algorithm. SVS employs the Diskin, et. al., 2008 method to help you both detect and correct for genomic waves.

Percentile-Based Winsorizing

Percentile-based winsorizing can be used to prevent segmentation algorithms from being driven by outlier values, resulting in a more accurate determination of regions of copy number variation.

 

Cluster plots of A and B allele intensities.

Cluster Plots of Probe Intensities

To validate the results of an association test it is beneficial to investigate the individual probe intensities of your top-hits to ensure they are not artifacts of poorly called genotypes. Upon attaining A and B allele probe intensities you can create an XY scatter plot to assess the validity of genotype calls. If you have access to Affymetrix CEL files SVS 7 provides and option to output an allele probe intensity spreadsheet during import.

» Download SNP Cluster Plots Script

 

SNP Concordance

SVS 7 makes it easy to check concordance between two data sets. This is useful when you want to check the validity of an assay run on the same set of samples multiple times. SNP concordance is also helfpul when comparing called and imputed or inferred genotypes, genotype calls between different platforms, stated vs. imputed gender, and more.

Genomic Control

This somewhat older method, pioneered by Devlin and Roeder [Devlin and Roeder 1999], notes that the chi-squared distribution of statistics from association tests being confounded by stratification will be more “spread out” than it should be. This will result in its median being higher than the median of a true chi-square distribution. Several models exist for how much the distribution should be spread out, depending on the test type, but the bottom line is that the distribution will usually be uniformly spread out by a certain “inflation factor” λ. The technique of Genomic Control measures this “inflation factor” λ by taking the median of the distribution of the chi-square statistic from results of an actual test done over a set of markers from the study in question, and dividing this median by the median of the corresponding (ideal) chi-square distribution. If the result is less than one, the distribution is considered close enough to ideal and λ is taken to be one.

© 2012 Golden Helix, Inc     Facebook     Twitter     Linked In     Blog   YouTube

Site Map   |   Privacy Policy   |   Contact Us