As I’ve mentioned in previous blog posts, one of the great aspects of our scientific community is the sharing of public data. With a mission of providing powerful and accurate tools to researchers, we at at Golden Helix especially appreciate the value of having rich and extensive public data to test and calibrate those tools. Public data allow us to do everything from testing the limits of our spreadsheet interface by importing whole genome samples from the 1000 Genomes project, to providing real-world data for incoming users to try full-feature examples of analysis workflows.
In this blog post we will examine the richness of Complete Genomics’ public samples available from their website. I will lead you through a series of filtering routines to reduce the search space from millions of variants to something more manageable (2,500 variants), all while discussing interesting characteristics of the dataset, perhaps giving you ideas for exploring your next sequence project. We will:
- Discuss filtering techniques based on quality assurance best practices.
- Filter out “common” variants based on population frequency.
- Examine the remaining variants’ alternate allele frequency to asses the bias towards SNVs in public annotations.
- Perform the KBAC method to determine population signals in the data.
- Check out an area of interest in the Genome Browser.
Using Data from Complete Genomics
With the recent release our our SNP & Variation Suite version 7.5, we’ve found the public data provided by Complete Genomics in their recent sequencing of 69 individuals to be perfect for putting our tools through the wringer. Complete Genomics is a sequencing service provider, with an explicit strategy of exclusively specializing in whole genome human DNA resequencing. As you become familiar with the outputs of their secondary analysis pipeline (which you receive as the deliverable for each sample), you may come to appreciate how this narrowed focus results in variant calls and other outputs far ahead of what’s being produced by most pipelines out there. While this gap may be explained by multiple reasons – not the least of which is the tight coupling of the analytics to the chemistry, along with properties of their proprietary sequencer as well as a heavy focus on bioinformatics validated by empirical experiments – it’s safe to say that Complete Genomics’ public data can really test tertiary analysis tools.
For this reason, I’ve chosen to use a subset of the variant data for those 69 public domain samples provided by Complete Genomics in order to demonstrate the richness of variant data while discussing the current best practices for filtering data to perform rare variant analysis as well as association-method techniques for assessing rare variant burden in your sample groups. For the sake of brevity and to focus our results, we’ll just look at chromosome 19 of these 69 samples (download the dataset from the first link on our variant classification tutorial page).
To dig in deeper, you can watch a demonstration of this workflow in our latest webcast event. To follow along at your own pace, try out our latest tutorial: Variant Classification and Visualization Tutorial, which describes how to perform the demonstrated analyses step by step.
Filtering for Quality Assurance
Filtering can be thought of as both a quality assurance process as well as analysis in its own right. From a quality perspective, you may want to ensure that variants are within the regions targeted by a selection method, such as amplicon libraries or an exome capture kit. You may also want to distrust variants in genomic regions that are potentially vulnerable to poor alignment, such as segmental duplications, short tandem repeats, alignment gaps and common CNV regions. These types of quality filters will be specific to the sample preparation and selection strategy.
What about filtering on the depth or quality of reads used to make variant calls? Ideally, the place for this type of short-read, context-aware filtering is at the variant caller level. When a variant caller does not have enough confidence to declare the genotype, it should not declare the genotype or, even better, report that genotype as being a no-call. In this respect, Complete Genomics takes great care in the variant file to provide a classification for every position in a chromosome. When not enough reads are mapped to a region, a no-call interval will be the output. While the majority of bases are expected to match the reference, this extra definition allows SVS to base every genotype in the variant matrix on the sample’s sequencing reads. In contrast, the matrix of variants imported from a set of VCF files will contain “holes” where a sample does not list a variant present in another sample. Because of these holes, we do not know whether the sample with the missing variant has enough sequencing data at that variant site to support the reference allele genotype (“wild” type) or not (“missing” genotype).
But back to our data from Complete Genomics: our genotypes are called with high confidence and holes in the variant matrix are filled in properly with wild type or missing genotypes. Moving on to analysis via filtering!
Filtering as Analysis
Let’s consider what would happen if we applied filters based on public annotations of “common” variants to exclude them from analysis. First, if you look at the manifest of samples for these 69 genomes, you’ll notice they all are reference individuals from the HapMap or 1000 Genome project. So in theory, we would expect the majority of variants with high allele frequency in this set of 69 samples to also be in the public data annotations for “common” variants.
In fact, given that these individuals were some of the sample biological sources used to define these public annotations, any prevalent variant not filtered by these annotations is most likely due to a difference in assay platforms. The Complete Genomics platform and analysis pipeline may be able to make calls on variants that the 1000 Genomes project did not. This difference could be attributed to the way Complete Genomics does a de-novo assembly around each variant site, and its variant caller makes a haplotype call for each of diploid chromosome separately. Or it may be due to the technical differences between Complete Genomics’ sequencer and the Illumina and SOLiD machines used by the 1000 Genomes project. Or it could simply be false-positives on Complete Genomics’ part.
Either way, it’s definitely interesting, so let’s take a look at what we find:
After filtering out the dbSNP 132 common variants and removing probes with alt frequency greater than .01 (i.e. “common”), 225,000 variants remain. Here is a alternate allele frequency histogram of those remaining 225k markers from Chr19 with a call rate > 90% (198k).
As you can see, the remaining variants are predominantly rare, as we would expect. In fact, of those 198K with a call rate > 90%, 100k are singleton variants (only present in a single sample).
So what’s in those bins of variants with higher frequency? Well, of the 14,000 variants with a AAF > 20%, doing a more stringent filter on the whole of dbSNP 132 knocks it down to just under 5,000. And of those 5K, only a measly 29 are Single Nucleotide Variants (SNVs). The rest are a combination of small insertions, deletions, and block substitutions (more than one consecutive allele replaced by one or more consecutive allele). These results indicate that even our well-studied reference samples have not had their indels and substitutions as thoroughly cataloged as their SNVs. This conclusion makes sense given the legacy of these samples first being studied on SNP microarrays and then by the 1000 Genomes project, which is still actively developing an InDel pipeline they are happy with.
Filtering (as shown above and spelled out in detail in our tutorial) is common to most sequence analysis workflows. With studies comprising just a few sets of samples, filtering based on the inheritance or presence of variants between family members and their affection status may be the only remaining steps (more on this workflow in future blog posts).
But when more than a handful of samples are available, the next step may be to statistically detect variants or regions of interest based on on sample phenotypes. Unlike the SNP-based microarray analysis strategies, we don’t want to rely on common Single Nucleotide Variants (or Single Nucleotide Polymorphisms when described in the context of a population) to have a genome-wide significant correlation with our phenotype. Instead, when looking at potentially rare variants, we would expect the mere presence of the variants collectively within a region of interest to be indicative of potential biological significance.
We have mentioned collapsing methods before including the CAST and Combined Multivariate and Collapsing (CMC) method. In SVS 7.5, we added the Kernel-Based Adaptive Cluster (KBAC) method and expanded the original specification of CMC and KBAC to utilize a regression framework.
The CMC method is designed to test in a multivariate framework multiple bins of variants, such as rare variants and common variants. This methodology allows for potentially fewer but functionally important rare variants be considered alongside a possible correction of common variants to the final test statistic. On the other hand, KBAC is designed exclusively to test the potential correlation of rare variants in a region with a phenotype.
The filtering step we left off with (above) had reduced our variant set to 225,000. Although KBAC can use any source of genomic annotations that define regions for collapsing, we are going to use the RefSeq gene annotations. To focus on variations of potential functional significance, let’s also remove the variants that can be classified using our Variant Classification tool as “Synonymous”. This in essence removes variants that, although they represent a nucleotide change in the coding region of a transcript, result in the same amino acid in the transcribed protein product (due to the fact that any given amino acid can be produced by several different codon sequences).
Now we are left with a set of 2,500 rare variants, so KBAC would be the appropriate method to explore any potential signal in this public data. But what could we use as a phenotype for this exploration? Well, any diverse sample set has a natural grouping based on ethnic boundaries. In this case, the largest single ethnic group would be Europeans from the CEU and CEPH populations, so let’s use the binary phenotype of European versus Non-European.
What would we expect to find when testing for a bias in rare variant burden between Europeans and Non-Europeans? Well, certainly the fact that the European genome has a “first movers” advantage will play a part when it comes to defining the reference genome. An average European will have 3-4 million variants to the reference, compared to the 7-9 million of an ethnic African. But these 69 genomes contain a number of admix populations such as Puerto Ricans and Mexicans as well. So let’s see what we find!
It looks like out of our thousand or so genes tested, about a dozen have fairly significant P-values. We investigate the DOT1L gene in our tutorial, but as I searched and navigated to each of these genes in our genome browser, I found that SIRT6 exemplifies the story behind this association.
Looking at those variant maps, you can immediately see that even the unfiltered variants have a strong correlation with the European sample group (The grouping along the Y axis is Blue for Non-Europeans and Green for Europeans). Following the links for SIRT6 in our data console to its NCBI entry, it seems the gene is involved in pathways associated with DNA repair under oxidative stress and other conditions. You could imagine posing hypotheses about the different behavior of this gene between these groups, potentially using RNAseq to investigate gene expression differentiation. This is just one example of hypothesis generation and experiment planning you can explore from your DNA resequencing analyses.
Of course, when looking at rare conditions in small sample sets, the goal may be to find those small number of candidate causal variants. In this case, a more directed filtering on sets of variants in your related individuals can be the most powerful approach. Whether excluding variants in closely-related but unaffected individuals, or looking for variants inherited by specific genetic models, this approach can vastly reduce the candidate variants to a size that can be inspected by hand with the help of a tool like variant classification to add some prioritization. But more on that workflow in the future.
With this demonstration of analyzing Complete Genomics public data, I hope I’ve conveyed a sense of wonder and excitement in showing the potential of DNA variant data in providing insightful stories of biology. With the right tools and support, more discoveries and more questions are just waiting to be had.
And that’s my two SNPs….