Q&A Surrounding Population-Based DNA Variant Analysis

         February 17, 2015

Last month, Dr. Bryce Christensen presented Population-Based DNA Variant Analysis via webcast. The webcast reviewed the fundamentals of population-based variant analysis and demonstrated some of the tools available in SVS for analysis of both common and rare variants such as the SKAT-O method, as well as other functions for annotation, visualization, quality control and statistical analysis of DNA sequence variants.

Here are questions generated by the attendees. Feel free to reach us at info@goldenhelix.com if you have more questions.

Question: How does CMC combine rare and common variants and prioritize variants for follow up? Do you use same p-value for both?

Answer: Initially, thresholds for minor allele frequency are set in order to create frequency bins for each variant. The method used in the demonstration was to set 5 different bins; <1%, 1-2.5%, 2.5-5%, etc. A collapsing test is performed in each bin and a multivariate statistic is used to detect total significance across all of the bins. CMC only gives gene
level p-values, so you may need to prioritize variants within the gene for follow up. To do this you would do a variant level association test for variants within the gene to detect
which, if any have strong associations. You can look up the annotations to determine the population frequency for all of them, as well as other functional data.

Question: SKAT-O results are at gene level. How do you prioritize
SNPs within genes?

Answer: Typically, you will not have more than a few dozen and they are typically not going to be carried by very many individuals. So, you need to follow up on all of them. At a
minimum, look at all the variants contributing to the tests and look at their functional data and frequency. From there you can determine which ones might have the greatest impact on the protein structure.

Question: How much is the overlap when different rare variants
analysis methods used?

Answer: In some cases, the methods are very similar to each other and sometimes quite different. It has to do with fact that they are not all testing the same hypothesis. The SKAT
family of methods are better when you have bi-directional effects. The Burden methods are better when you have unit directional effects.

Question: LD-pruning for rare variants; are rare variants in LD?

Answer: The filter used in the demonstration was first to reduce to only variants with an observed frequency greater than 1%. With the sample size we used, it indicates a very
large number of people carrying it and we should be able to detect LD levels pretty well. Of course at lower frequencies, those variants are rarely going to occur together with
anything else, making it diffi cult to calculate an accurate LD statistic.

Question: How do you choose between Burden and Kernal tests? What types of study designs are there?

Answer: Burden and Kernel tests are very similar. As discussed in the webcast, Kernel tests like SKAT may have broader application due to their ability to handle variant effects in opposite directions.

Question: How do you evaluate the quality or power of the KBAC, SKAT and SKAT-O, etc… p-values?

Answer: It is very difficult to evaluate power, simply because there are few, if any, “gold standard” examples available. Any power estimates will usually be based on simulated data and will merely compare various methods under particular simulated scenarios.

Question: Is the SKAT-O method suitable for the continuous
variable as the phenotype?

Answer: The SVS implementation of SKAT-O only works with binary phenotypes.

Question: Can we use this analysis procedure to analyze GWAS data?

Answer: Collapsing tests are intended for use with sequence
data only.

Question: How do the KBAC, SKAT-O, and CMC methods work when analyzing the common variants association, which one is the best?

Answer: CMC is the one that is really designed to handle common variants the best. KBAC is strictly a rare variant test. SKAT, we have not had enough experience with to really say how it is going to behave with common variants. If the default weighting scheme will really analyze more common variation unless you use an alternate weighting scheme.

Question: How useful and representative is extreme phenotype for rare variants discovery to the general population?

Answer: Most of the methods we talked about here today are designed for binary phenotype outcomes. If you are looking at quantitative traits and extreme values, that is an acceptable use case. Especially if you convert it into more of a dichotomous outcome comparing the really extreme cases against the general population.

Question: How do these collapsing tests integrate study designs that involve related family members? i.e. TRIOs or greater with affected and unaffected family members.

Answer: The tests showed today are all tests that assume population-based study designs. They are not going to account directly for family based data collection. That being
said, if you look around in the literature there are alternate versions of some of these tests like SKAT-FAM that is designed to incorporate some sort of transmission statistic within families. Within SVS, your best option would be to use the MM-KBAC or mixed model KBAC where you can incorporate a kinship matrix into to the analysis. What it will do then is correctly adjust the tests recognizing that some samples with the same variant happen to be related to each other although it will not necessarily incorporate any kind of transmission or linkage information.

Question: What is the variance explained by the 5th PCA component in the Exome vs Affy data? If it is very low, would you be concerned about the very different outcomes for the comparison?

Answer: What you see in the Affy data; the blue groups are the African population. That 5th component primarily, is separating the different sub-populations within the African
super-populations. If you were to look at that univariate as a histogram you would see that 5th component is 5 major clusters. While if you looked at the exome data univariate
as a histogram it would be more of a smooth curve that is detecting some other difference that is more uniform across all the samples. They are clearly detecting different effects. For the second half of the question; I wasn’t really interested in using the components for anything except to find the population structure.

Question: I would like to use my DNA sequence data to study signatures of selection to see if there are SNPs within the growth hormone gene that the animals or the population selects for. And if it is possible, do association.

Answer: SVS can be used for association testing, and can be used to annotate SNPs by genes. However, there are no particular functions for analyzing selection.

Question: Is the 1KG phase 3 data for all chromosomes available in SVS now?

Answer: That is coming soon! We are working on updating a lot of our annotation sources.

Question: Some of the mutations in my DNA sequence data of the goat growth hormone gene have resulted in amino acid changes. Is there a way that I can use SVS to further analyze these variants and/or to do an association study to associate them with some of the economic traits?

Answer: Yes, SVS can be used for association analysis.

Question: Can SVS do collapsing tests on a pathway level. For instance, rather than single genes but groups of genes in a biological pathway. In such a case what is your control – a randomly generated groups of genes?

Answer: No, SVS currently collapses on individual genes only.

Question: How do I create an input file for SVS for the analysis of DNA sequence data from the Illumina sequencing machine?

Answer: The primary supported input are VCF files. Typically, if you have sequence data, you already have VCF files. SVS also takes PED, BED, and fl at text files.


Leave a Reply

Your email address will not be published. Required fields are marked *