I recently curated the latest population frequency catalog from the 1000 Genomes Project onto our annotation servers, and I had very high hopes for this track. First of all, I applaud 1000 Genomes for the amount of effort they have put in to providing the community with the largest set of high-quality whole genome controls available.
My high hopes are well justified.
After all, the 1000 Genomes Phase 1 project completed at the end of 2010, and they have released their catalog of computed variants and corresponding population frequencies at least five times since.
In May 2011, they announced an interim release based only on the low coverage whole genomes. This release was widely used, and one we also curated. Then in October 2011, their official first release was announced – an integrated call set that incorporated their exome data. Following that, version1 was released and re-released three times throughout November and December 2011. In 2012 we saw version2 in February, and finally version3 was released in April.
But as it turns out, simply using the 1000 Genomes Phase1 Variant Set v3 as your population filter will fail to filter out some common and well-validated variants.
Before jumping in, I have to mention that simply stating in a paper or talk “we filtered by 1000 Genomes data” is completely, entirely unreproducible. Not only does that not specify a version of the Phase1 dataset, it also doesn’t specify if you used the Phase 1 release or one of the preceding 3 pilot datasets. It also doesn’t indicate which, if any, of the sub-population frequencies were used.
For this reason, Golden Helix names all of its tracks with specific source and version information and recommend that they are cited with these details.
But why did the 1000 Genomes Project have so many releases of the same source data?
Well, they had very legitimate reasons.
So Why All These Versions?
In their recent publication in Genomic Medicine, Gholson Lyon and Kai Wang discuss the challenges of finding causal variants in a clinical setting. While the challenges range from incomplete phenotyping to lack of decisive functional validation of putative variants, fundamental to any strategy of interpreting a list of variants is knowing whether each variant is novel to the individual or family, or how common it is in the population background.
The paper points out examples where variants were published partly on the notion that they were novel and thus likely to be impactful, only later to be discovered in the NHLBI-ESP project set of controls.
[This] underscores the importance of public databases on allele frequencies from large collections of samples, such as the 1000 Genomes Project and the NHLBI-ESP projects, in helping researchers and clinicians decide on the clinical relevance of specific variants in personal genomes.
So it makes sense that the 1000 Genome Project revisits their collection of 1,092 samples of low coverage whole-genome and high coverage exomes to build up-to-date variant call sets that include their latest algorithms for variant calling and quality filters.
Bigger and more accurate sets of controls will result in higher quality research by the community. Especially when the novelty of variants is a key factor in classification as an important functional variant, we want these controls to be as comprehensive as possible.
It is not only the size of your control database that matters, but also the choices you make in calling variants and balancing Type I and Type II errors in the QA filters of those calls.
What It Takes to Build a Variant List
While we probably haven’t seen the last bid to make a better variant calling algorithm, GATK has become the de-facto industry best-practice choice. It has no rival in the task of simultaneously considering a thousand sample’s sequence alignments while making a consensus genotype call and associated likelihood score.
Variant callers rarely suffer from a problem of sensitivity. If you see mismatches in a few reads line up, they can be interpreted as a heterozygous or homozygous variant at that locus.
The real challenge to a variant caller is dealing with all the ways in which errors (in sequence reads, repeated regions in the reference that confound mapping, small insertion/deletions that disrupt local alignments, and variability in coverage and capture technologies) can result in false-positive variants.
So it’s not surprising that most of the effort that the 1000 Genomes Project lists in its read-me’s that distinguish previous Phase1 variant sets from newer ones is in systematically improving their variant calling and post-calling filtering to remove these false-positives.
In fact, in the case of InDels, they report that through orthogonal validation of their InDel calls and comparison to Complete Genomics public genomes of some shared samples, they discovered a number of systematic errors in the pipeline that produced a significant number of false-positives:
This is an updated set from the version 2 (February 2012) of the 20110521 release.
~ 2.38M problematic indels have [been] identified and removed.
See here for more information.
Shaving with Too Sharp a Blade
So when testing this latest v3 variant track from 1000 Genomes, I decided to try filtering down the Complete Genomics 69 sample set to see how it performed.
Complete Genomics not only uses a completely different sequencing technology, it also has developed a set of custom algorithms for alignment and variant calling specialized to its own technology.
So I expected that filtering by 1000 Genomes would knock down most Single Nucleotide Variants (SNVs) and many small insertions and deletions, but to leave unrecognized more complex variants that can be legitimately described in various ways or variants that favor one of the two technologies involved.
Instead, I was struck by this:
This is just one spot among many where 1000 Genomes v3 didn’t list what looks like common variants in the Complete Genomics public genomes and their own earlier variant call sets.
How pervasive is this?
Well, there are definitely new variants added to the v3 call set. But the May 2011 version did not yet have InDels. Of the 2.7 million variants unique to version3, 577,895 are insertions, 865,619 are deletions and 1,301,605 are SNVs. So for just the SNVs, there are roughly a million removed since the earlier release and a million added. The 1.1 million removed variants from v1 might have cut too deep.
But here is another important thing to keep in mind. dbSNP, the de facto catalog of all known variants, used the May 2011 interim variant set shown above to set the global allele frequency data for the variants in its catalog. While I’m sure in time they will update their attribute data with the latest frequencies, the pervasive use of dbSNP surely means people will assume the frequencies listed on their view of a SNP are the most trustworthy.
Still Waiting for the One Track to Rule Them All…
I wish I could recommend the use of a single variant population track that could solely be used as the source of population variant frequency data for your study.
Like most things in bioinformatics, however, the answer is that the best tool for the job varies by the job.
For exome data, the NHLBI ESP variant set is invaluable. For whole genomes, a combination of this latest 1000 Genomes variant set and the CGI 69 samples variant set are probably a good start.
Of course, for working with small sample sets, augmenting population frequencies with control individuals run on the same platform and secondary analysis pipeline helps to dramatically reduce systematic false-positives from your variant list.
So while I’m pleased to see the continuous effort to build the one true variant frequency catalog, I’m afraid we aren’t quite there yet.