In the past couple of weeks, the topic of the Filter and Quality fields in the popular ExAC population catalog has come up a number of times. It turns out that unlike the 1000 Genomes project, which decided to very heavily filter their variant list to only contain variants they consider high quality, ExAC chose to include more dubious variants but to flag them with various QC checks that they failed.
This is a fine decision on their part, but it requires users of the ExAC catalog to treat any QC flagged variant with high levels of suspicion. In fact, you should consider any variant in your data that matches a QC flagged variant as if it was not in the catalog at all. (Edit: See comment below from Dan MacArthur, PI and team lead for ExAC/gnomAD, on a better phrasing of this qualification)
To make this case, I will cover a rare pathogenic variant in my own exome that is truly novel with regards to public population catalogs, but shows up in ExAC as a false-positive QC flagged variant.
The Variant That Wasn’t
With the massive new set of clinical assertions added to January’s ClinVar by Illumina, I thought it might be a good time to revisit my own exome’s variant analysis.
Using VarSeq, this is quite a straight forward process. I can just open up my existing personal exome analysis project and follow the notification in the top-right of the project window to update all annotations to their latest versions (including ClinVar).
With the latest annotations, a variant I possess in a heterozygous state caught my attention: NM_002769.4(PRSS1):c.86A>T (p.Asn29Ile) is listed in ClinVar as Pathogenic for Hereditary Pancreatitis in a gene that has been shown by OMIM to act in a dominant model. This means my single mutated copy would be sufficient to cause the disease!
Under the Summary Evidence tab in ClinVar, Illumina provided a very thorough clinical analysis of the variant and its relationship with hereditary pancreatitis (abbreviated here):
Across a selection of available literature, the c.86A>T (p.Asn29Ile) variant, also referred to as p.Asn21Ile, has been reported in at least 160 hereditary pancreatitis (HP) patients (Gorry et al. 1997; Ferec et al. 1997; Otsuki et al. 2004; Sahin-Toth et al. 2006; Lee et al. 2011; Wang et al. 2013).
[…] In addition, in a review of a variant database maintained by Leipzig University in Germany, Sahin-Toth et al. (2006) observed that the p.Asn29Ile variant was the second most common PRSS1 variant associated with HP, accounting for approximately 25% of all identified pathogenic alleles in PRSS1.
The p.Asn29Ile variant was absent from 382 controls and is not found in the 1000 Genomes Project, the Exome Sequencing Project, or the Exome Aggregation Consortium.
[…] Based on the collective evidence, the p.Asn29Ile variant is classified as pathogenic for hereditary pancreatitis.
If you carefully examine the GenomeBrowse screenshot above, you will notice that although Illumina claimed this variant was absent from ExAC, it actually showed up with a 47% allele frequency!
A Decoy that Improves Alignments
Zooming out of my GenomeBrowse window made me start to suspect the validity of alignments and variant calls for my exome in this gene.
It is clear from the gene-level perspective that the variant highlighted I was investigating was not the only candidate mutation in what is documented as a highly conserved dominant-model gene.
I will skip the investigative journey and sleuthing steps that followed, but here is the upshot of what is going on:
- The 1000 Genomes consortium added a “decoy” sequence to their GRCh37 human reference that was generated by Heng Li in 2011. The decoy had real human sequence that was not properly placed in the known-to-be incomplete GRCh37 human reference. The idea was by having real human sequence align to the decoy, they would not incorrectly get aligned to second-best places along the reference genome.
- The ESP6500 exome project used this “b37+decoy” (aka hs37d5), as did the 1000 Genomes project on all their releases
- The ExAC project did not use this decoy in their alignments
- PRSS1 has a pseudogene paralog that was not fully represented in the primary sequence of GRCh37, but was captured in the decoy
- My exome and ExAC thus called many variants in PRSS1 that are not truly there, but are the best attempt to align reads for a gene that was not fully present in GRCh37
- GRCh38 no longer needs this decoy, and alignments to PRSS1 don’t suffer from these misplaced reads
As a double-check, here is the same PRSS1 gene in my exome aligned to GRCh38:
And what about ExAC’s QC flags?
This variant is flagged as InbreedingCoeff_Filter both in the online ExAC Browser and in the Filter field of the ExAC annotation source in VarSeq.
I suggest relying on this field to screen your ExAC annotations for false-positives and to potentially and carefully consider the validity of the variant calls in your own data, as I did here!
And to pre-empt questions gnomAD (ExAC expanded to 123K exomes), yes this variant shows up in the gnomAD Browser and also has QC flags applied to it as well.
We are working on curating the very recently released gnomAD exome and genome allele frequency data and will follow up with more on that shortly.