To say the announcement of Dan MacArthur’s group’s release of the Exome Aggregation Consortium (ExAC) data was highly anticipated at ASHG 2014 would be an understatement.
Basically, there were two types of talks at ASHG. Those that proceeded the official ExAC release talk and referred to it, and those that followed the talk and referred to it.
Why is this a big deal? Well, since they have generously published their data not only in a great online browser but also in a downloadable VCF file, let’s take a look at the data and see for ourselves.
ExAC is the latest addition to a class of key public resources for doing variant analysis: the population catalog.
Population catalogs can have different goals, but they all provide the ability for a set of variants from your own samples to be annotated with the critical information of which variants are rare, common or are of different ethnicity and origins.
We spend a lot of time curating population catalogs, because knowledge of a variant’s prevalence is a very powerful filtering criteria. Getting their raw data from these catalogs converted into well formatted fields makes it easy to meaningfully bin your own variants, whether it’s on allele frequencies, germline versus somatic origins, or clinical significance.
While ExAC has just arrived, the established variant catalogs have not been standing still. Here is a review of the existing catalogs and their recent updates. Note, all of these sources are available in the Golden Helix public annotation repository accessible from GenomeBrowse, VarSeq, and SVS.
NHLBI EVS/ESP (6500)
It didn’t seem that long ago that I remember the buzz at ASHG 2011 that the NHLBI where releasing 2,600 exomes through the HYPERLINK “http://evs.gs.washington.edu/EVS/” EVS server. Since then, multiple releases have brought the total number of exomes in this well annotated catalog to 6,500. These exomes are now included in ExAC! (Update: Dan MacArthur clarified that only a subset are included in ExAC due to dbGaP restrictions).
1000 Genomes Phase 3
While it has the comparatively small set of 2,500 individuals in its variant catalog, the 1kG project does sequence and call variants across whole genomes, bringing their total variant sites called in their latest Phase3 release to 79 million! The earlier Phase 1 variant set was called from 1094 individuals and contained 39.5 million unique variant sites.
COSMIC is a population catalog, but instead of cataloging germline mutations found in the population, it catalogs somatic variants discovered in cancer samples. With the integration of variants from The Cancer Genome Atlas, version 71 is the most compressive catalog of somatic mutations ever created and provides details about the tumor types and histology in which they occur. We also provide the gene level annotations COSMIC provides with relevant summary and curated oncology details.
Providing the critical services of enumerating unique variants with accession numbers in the form of RS IDs, the NCBI dbSNP database ends up integrating most other germline variant catalogs. The allele frequency information in the database is provided directly from 1000 genomes Phase 1 release, but dbSNP contains variants from many other sources as well (some of more dubious quality, but many flags are provided to weed these out).
With monthly data releases, NCBI ClinVar is an up-to-date repository of annotations about the clinical significance of variants as they relate to specific diseases. Many labs have contributed their archive of these classifications, and continue to submit on a regular basis their findings from ongoing clinical diagnoses. Nearly every gene with a known clinical trait association is well represented in ClinVar, and NCBI does an excellent job cross referencing each variant/disease classification with relevant clinical resources.
While not strictly a population catalog, the database for Non-Synonymous Function Predictions does exhaustively enumerate every single letter substitution in the coding regions of the human genome that alters the protein sequence and annotates them with the output of many functional prediction and conservation score algorithms. The latest version has 9 functional prediction algorithms and 7 conservation scores applied to 90 million non-synonymous variants.
Not just for GRCh37
It’s impressive to see how quickly public annotations have become available for GRCh38, released just this last December. Care must be taken when creating population catalogs for GRCh38, because variant sets generated from NGS data are aligned and called against a specific reference assembly.
While 1000 Genomes has no plans (last time I had a chance to ask them) to re-align and call variants in GRCh38 coordinates, dbSNP has remapped all their cataloged variants (with the 1000 genome frequencies attached) since build 141. Similarly, the NHLBI variant set and dbNSFP has been “lifted over” to GRCh38 in their latest release.
An Order of Magnitude Bump
But what is crazy about the ExAC variant set is the absolute magnitude of their sample set. A whole order of magnitude larger than the previously largest exome catalog, it promises to expand our list of known variants dramatically… and it absolutely does.
The first thing this size of a cohort affords is a diversity of distinct population sub-groups. In this case, the bioinformatics team used PCA to stratify samples into the 7 population groups of:
- AFR: African & African American
- AMR: American
- EAS: East Asian
- FIN: Finnish
- NFE: Non-Finnish European
- SAS: South Asian
- OTH: Other
For each of these groups, there is a set of fields for each variant that provides the following informative counts:
- Alt Allele Counts: Counts for each observed alt allele
- Chromosome Counts: Total chromosomes called (usually 2xSampleCount)
- Homozygous Alt Counts: Number of samples with homozygous genotype
- Heterozygous Alt Counts: Number of samples with a heterozygous genotype
What this means is that you can, for example, ask if there are any variants in NGLY1 that are homozygous in Americans.
The last field of the 2.9GB compressed ExAC VCF is the very large “CSQ” field containing consequence predictions from Ensemble Variant Effect Predictor (VEP) for each variant. This field is a comma-delimited list of predictions per-alternate-transcript pair, with each field of the predictions delimited by ‘|’ and list items within those delimited by ‘&’. Got that?
With such great content in there, such as per-transcript SIFT/Polyphen predictions and the VEP provided sequence ontology and HGVS nomenclature, we thought it would be worth doing the work to make this dense, triple delimited field used as an annotation and filtering source.
So that’s what we did. We are going to release alongside the ExAC Variant Frequencies source containing all the frequencies and counts I described above, a second ExAC VEP Annotations source that breaks out all 53 million alternate-allele/transcript pairs into well formatted and “listified” fields with hyperlinks and filterable categories. This essentially allows you to have VEP annotations very cheaply for all known variants.
Yes, we provide all these annotations through our own algorithms and from other population catalogs directly, but as always in research, it never hurts to have multiple independent data points.
VarSeq Chomps on ExAC for Lunch
As I promised at the beginning, let’s actually do some analysis on ExAC to see what the most expansive variant population catalog in the world looks like.
If you haven’t heard before now, Golden Helix just launched VarSeq 1.0, and as a variant annotation and filtering tool designed to scale for handling thousands of exomes. Naturally, I fired it up and imported the ExAC variant set. Note that I did this whole analysis, from import to annotation, filtering and some investigation on my desktop computer over the course of my lunch yesterday.
Keep in mind the uncompressed VCF that represents all the data from this table I’m able to scroll through effortlessly on my desktop is 20GB in size!
From the histogram of the Alt Allele Frequency column created by simply clicking on it, we are not surprised to find that very rare variants dominate the dataset.
Using VarSeq’s algorithm catalog, I ran the Variant Type algorithm and in less than 30 seconds, had a breakdown of the types of variants in ExAC.
And finally, to really get a sense of how many of these millions of variants we have seen in the population catalogs I mentioned above, I annotated against each one and then added filters to the VarSeq filter chain to get a list of all variants seen in any of these. Note that each annotation took 2-3 minutes and this all occurred on my modest desktop without exceeding 1.5GB of RAM. This is laptop-duty work for VarSeq.
I added two thresholds to the frequency based population catalogs to point out that nearly all the annotated variants are below the 5% frequency threshold. dbSNP contains the most known variants.
But what is amazing about this analysis is that 6,637,439 variants in ExAC have not been cataloged in any population catalog.
In essence, ExAC has just tripled the number of known variants in coding regions of the human genome!
Finally, I thought it would be interesting to investigate those 5,670 variants in ExAC that have a Pathogenic clinical significance annotated in ClinVar. After sorting on allele frequency, it might be surprising to find that some variants marked as “Pathogenic” in Clinvar have frequencies close to 1. This finding highlights the importance of not assuming that the reference allele is always the clinically neutral allele!
Invaluable in Unknowable Ways
I can’t imagine what this catalog means to all the researchers out there studying human genetics.
I know that while I was sitting in the ASHG session when ExAC went live, I quickly forwarded the gene-detailed page on NGLY1 on to Matt Might, who after having his son diagnosed with NGLY1 deficiency has initiated an amazing global network of family advocates for the disorder and has drawn the attention of the national press and was even plugged in the ASHG presidential address. Constantly in search of a better understanding of this diseases prevalence, Matt was amazed to see the details of the loss-of-function variant he carries and passed on to his son.
And that’s my two SNPs.