I’m a believer in the signal. Whole genomes and exomes have lots of signal. Man, is it cool to look at a pile-up and see a mutation as clear as day that you arrived at after filtering through hundreds of thousands or even millions of candidates.
When these signals sit right in the genomic “sweet spot” of mappable regions with high coverage, you don’t need fancy heuristics or statistics to tell you what the genotype is of the individual you’re looking at. In fact, it gives us the confidence to think that at the end of the day, we should be able to make accurate variant calls, and once done, throw away all these huge files of reads and their alignments, and qualities and alternate alignments and yadda yadda yadda (yes I’m talking BAM files).
But we can’t.
Thankfully, many variants of importance do fall in the genomic sweet spot, but there are others, also of potential importance, where the signal is confounded.
Confounded by what? It’s tempting to say the signal is confounded by noise. And in the analog world, a signal with too much noise is a lost cause. Despite what nearly every detective show in the world tells us, you can’t take a grainy image and say “enhance” and get a high-definition rendering of the suspect.
But thankfully, the world of genomics and Next Generation Sequencing is not an analog world. And there are very discrete and tractable reasons why some loci and types of variants cause us so many problems.
And for the most part, we understand these problems!
Let me enumerate the biggest ones:
- Variants being called individually and not in a haplotype aware manner.
- GC rich regions or regions that produce PCR or sequencing artifacts
- Regions of the genome with low mappability or high homology to other regions either large or small.
- Poor homology between the sample’s genome and the reference. For common reasons such as:
- Structural variants like inversions or translocations in the sample.
- Sites with tandem repeats or mobile element insertions.
- Sites where the reference sequence is unable to capture the allelic diversity of the population and the sample (like the MHC region).
Not to mention, when looking at a sampling space so large, very rare events from a statistical perspective will occur. So if you have a 30x genome, you should expect a heterozygote on average to be covered by 15 reads to support the mutant allele. However, there will be loci where a heterozygote legitimately exists but it gets sampled only once or twice. (The chances work out to expecting to see this 1.74 times when calling 4 million variants). Far from a convincing number to distinguish it from a sequencing error.
These issues are not new, and we have even relegated their effect to a minority of variants. In fact, it seems like for years the assumption has been that we are just a small incremental step away from having variant calling licked.
I remember reading a blog post in mid-2011, by the Chief Science Officer at Complete Genomics, pitching that we should be throwing away our read alignment data in most cases. Variant files with quality scores should be good enough. Otherwise go resequence the sample!
But in reality – and I watch this stuff closely – over the last three years we have made very little progress towards handling these systemic issues.
And that’s pretty astounding.
One Cell Line to Rule Them All
Since the products we create at Golden Helix for doing variant analysis and interpretation start after variants have been called, I have become intimately aware of the state of data our users receive from their upstream sequencing provider.
Recently, I decided to do a more thorough comparison of the same samples’ variant calls from a representative set of three sources: Complete Genomes, the Illumina Genome Network, and a typical Core Lab’s pipeline (hosted by the 1000 Genomes folks at the Broad Institute).
By the way, between these comparisons and reviewing the state of algorithms for alignment and variant calling in general, I was easily able to fill a three hour short course at last week’s X-Gen Conference in San Diego. If you were not able to make it, you have two great options. 1) Sign up to attend a 90-minute webinar presentation this Wednesday, which will include much of the same content as the short-course or 2) Sign up for the short course I’ll be giving later this year at NGX in Rhode Island.
Luckily for my purposes, a specific trio from the CEPH extended pedigree (Utah pedigree 1463) has become the de facto standard for benchmarking sequencing platforms and secondary analysis pipelines.
The NA12878 daughter and parents trio was included in the HapMap project, the 1000 Genomes Project and is available to any lab from Coriell as a cell line (for ~$85 a sample).
In fact, NA12878 is easily the most sequenced genome in the world!
Complete Genomics’ data is available on their website, Illumina provided me with a hard drive containing standard IGN deliverables for the trio, and with a little digging, I found a comparable genome at 60x coverage produced by a pipeline that matches the Broad’s GATK best practices (one that many core labs model their deliverables on). The latter I refer to as the “BWA/GATK” pipeline and will reflect many pipelines in production to produce both research and clinical use exomes and genomes.
So with SVS as our workhorse, it was no problem at all to import all three of these source’s variants and do a little comparison.
So here is the big picture:
Note that both IGN and the BWA/GATK pipeline are based on HiSeq 2000 which produced 100bp paired-end reads. So they use the same sequencing technology, but significantly different alignment and variant calling techniques. IGN uses the ELAND aligner and CASAVA variant calling, versus BWA for alignment and then various steps to clean those up (de-duplication, quality score recalibration and local indel re-alignments) before calling variants with GATK.
The Illumina based sources do agree on another 10% of the variants not also called by Complete Genomics. Given that Complete Genomics uses an entirely different sequencing technology as well as alignment and variant calling algorithms, this isn’t that surprising.
What may be surprising is breaking these numbers down by Single Nucleotide Variants (SNVs), which we commonly think of as easy variants to call, and small Insertions/Deletions (InDels) which are known to be inherently more problematic.
To be fair, there are a lot of variants in these sets that would be filtered out as too low depth or poor quality in most analysis scenarios. Many of the variants that are called by only a single source probably fall into that camp.
It’s a bit tricky though to pick filtering thresholds in a consistently fair way between these different sequencing depths and platforms. More importantly, for my purposes, or any benchmarking use case, how does one know which of these three opinions of a variant’s definition is the truth?
My favorite way to find false positives with knowable truths is to look at de Novo mutations.
I didn’t have the full trio for the BWA/GATK pipeline, but you can immediately get a feel for how much “noise” remains in these genomes by looking at the overlap of de Novo candidates between the IGN and Complete Genomics pipeline. Note, I’m narrowly defining de Novo events to just those sites where the child is heterozygous and both parents are homozygous reference.
Since it has been established that the number of true de Novo SNVs in a genome is less than 100, we can expect one or more of the three individual genotypes at each of those ~150k sites in each source to be called incorrectly (i.e. they are false-positives).
Even the ~4000 sites where the two sources agree, the number of events is highly inflated. Although one could argue we can expect a cell line like NA12878 to have a higher mutation rate than fresh germline samples, we are still an order of magnitude off here.
If we were to start to iterate through these ~4000 de Novo sites, we would see many of the enumerated confounders I listed above well exemplified.
Although I will go into this in more detail in Wednesday’s webinar, here is a great example:
But of course, you will also find some real de Novo mutations as well, such as this one.
Don’t Panic, Just Bring Your Towel
By definition, we are looking at the worst and hardest to get right variants in these examples. It’s easy to feel a sense that these are lost causes. Hopeless cases.
But if I could inscribe some words on the cover of your newly minted genomes, they would be: DON’T PANIC.
While galactic hitchhikers may find a towel to be an indispensable utility, genome explorers should always be toting a genome browser as they come out of hyperspace and into a genomic coordinate of interest.
So the good news is, you can learn to spot these suckers a kilobase away with a good browser. (Again, sign up for my webinar for more examples and training to do just that).
And the second piece of good news is that after years of frustratingly slow progress, these problem variants are being tackled by a whole new generation of offensive tactics.
Here are a couple that I’m aware of, maybe you can think of even more. If you can, please list them in the comments section.
- NIST is working on defining a “Genome in a Bottle” reference whole genome variant sets so we have both a gold standard variant set and common strategy in which to describe variations from the reference.
- The Genome Reference Consortium is coming out with GRCh38 this summer, applying roughly 200 patches to fix problematic regions of the genome.
- Additionally, GRCh is advocating thinking about the reference as a “graph” with multiple alternative paths for regions like the MHC with high population allelic diversity.
- The 1000 Genomes Project and others are thinking about classes of variants like multi-nucleotide variants as a vector of improvement.
- The GATK team is building their new variant calling framework in a haplotype aware model (called aptly the HaplotypeCaller).
- Read lengths are increasing and alignment algorithms like BWA-SW are being designed to handle them in ways that should improve detection of InDels.
- Long-read technologies such as Moleculo from Illumina could significantly improve calling of structural variants and other places with large scale divergences of a sample’s consensus sequence from the human reference.
- Start-ups like Real Time Genomics are focusing on the problem of “clinical grade” genomes and doing things like calling trio genotypes in a pedigree aware manner.
I don’t expect any of these individually to be a silver bullet. But in concert, I’m looking forward to some remarkable improvements in the coming year.
In the meantime, keep that genome browser handy and don’t panic.
Note: Thanks to Autumn Laughbaum for helping with the concordance analysis and graph generation.
Note 2: If you are just getting into genome exploration, a good start might be my Hitchhiker’s Guide to Next Generation Sequencing (Parts One, Two and Three). Yes, it may be equally laden with Douglas Adams references.