There is nothing cooler than having something arrive that you have been excitedly waiting for: last week I got an email notification that my 23andMe exome results were ready.
Actually, I got 3 emails that my exome results were ready.
You see, I lucked out.
It all began two years ago on DNA day when Hacker News reported that 23andMe was running a special deal on their personal SNP-array genotyping and interpretation service. Looking across the room at my 7 month pregnant wife, I smiled and pulled out my credit card.
I then proceeded to enter in its numbers – 3 times.
Thankfully 23andMe allowed for returning your spit-in-a-tube DNA up to 6 months from the purchase of your order. Given one of my spit providers was my yet-to-be-born son, this was a very fortuitous policy.
Minus the frustrations of getting a newborn to provide what seems like a million little droplets of spit, the 23andMe customer experience turned out to be really quite entertaining and useful.
For example, based on the roughly 1 million SNPs, dispersed across the genome, 23andMe can predict fun traits like earwax type and the ability to taste bitter food. They can provide useful pharmacogenomic assessments like your sensitivity to Warfarin and how fast you metabolize caffeine. Finally, they can predict your lifetime risk of contracting common genetically-linked diseases such Coronary Heart Disease, Type 2 Diabetes, and certain types of cancers.
But as I talked about in my recent post, the research that 23andMe’s predictions are based on have serious limitations as to how much of the real genetic risk for these complex diseases it can account for with the common SNP-based genotyping it provides with its service.
There will probably be many ways in which further research and sequencing techniques will account for this missing genetic risk, but a promising research direction is examining rare variants in and around the protein-coding genes (the exome) that have the potential to directly influence biological function.
Sequencing human exomes has recently become affordable with Next Generation Sequencing. And more importantly, this technology has shown in many cases to be an effective tool to diagnose rare, serious, and sometimes deadly genetic disorders.
By sticking to the common SNPs genotyped on a custom microarray, 23andMe has been keeping to the previously researched and well-annotated pieces of genomic information.
In contrast, exomes have the potential to uncover not only rare and clinically relevant variants, but in nearly every case are likely to uncover damaging variants of unknown significance.
Stage Left: Your Exome
It is not something I saw through an advertisement or other direct promotion by 23andMe, but at some conference or talk I heard from within the genomic community: 23andMe will sequence your exome for $1,000!
Immediately, I thought: “That’s cool, but my exome is probably pretty boring… What good would an exome be without some interesting angle to investigate?”
Then I realized, there are two exomes I would be very excited to analyze: my wife’s and my son’s.
You see, my wife was diagnosed as a teenager with a relatively rare autoimmune disease. While the symptoms and progression of her disease are managed with the amazing advances of biologic-based therapies like Enbrel, the genetic architecture of immune disorders are complex and suspected to be highly influenced by rare and private mutations.
And, like many fathers, I have an insatiable curiosity about everything that has to do with my son.
What could be more indulgent of that curiosity than being able to browse his DNA? I imagined, at times, that by watching lists of variants and genes scrolling by, I’d get insights into the fascinating, complex creature I watch with loving obsession each day.
So then the idea came to me, and I couldn’t let it go.
I have a trio I could analyze!
Third Sample Is the Charm
There are good reasons why clinical researchers that diagnose rare Mendelian diseases almost always ensure they can sequence not only the affected child but the father and mother as well (and preferably have the option to include living grandparents as needed).
A trio (a father, mother, and child sample set) gives you a lot of unique analytic capabilities. For example, you can use the fact that all variants in the child should either be inherited from a variant in the mother or the father to filter out spurious or low-quality variant sites in the child.
Well, unless the variant is actually occurred de Novo. But given the natural mutation rate of humans from generation to generation, we only expect to see, on average, about 1 de Novo mutation in our exomes per generation.
A trio can also be used to analyze the inheritance of variants, such as when two carrier parents pass on their risk allele to a child.
As you can imagine, this opens the door for new and exciting things to discover and speculate about.
Given the 23andMe exome pilot project by default limited you to one exome per account, I was very excited when they agreed to sequence my whole trio. (When talking to Brian from 23andMe at the TCGC conference recently, it turns out mine was one of just two trios in their pilot!)
I was even more excited when the data arrived last week.
A Sneak Peek at my Results
While a more detailed analysis will be forthcoming, I’ve already had some fun diving into my exome.
Each sample arrives in an encrypted bundle, with the decryption keys stored in your 23andMe account. That bundle of data, weighing in at around 10GB, is not intended for casual use.
It contains three things:
- A PDF report, giving you a high level overview of the variants called on your exome, and a small report on the rare variants that fall within genes with known Mendelian disorders associated with them.
- A VCF file of your variants (weighing in at a measly 7MB)
- A BAM file of your aligned sequence data (roughly 10GB), generated by a HiSeq2000 and the Agilent exome capture kit.
Dr. Jung Choi recently blogged about the 23andMe variant overview PDF, so I’ll refer to his description of this report.
I’m sure many other recipients have been browsing through their report and a few may even be diving into their raw variant list.
But something I know I wanted to do for this post, something I know others receiving their exome data simply cannot, was to utilize that largest component of my exome data download: the actual sequence data in the BAM file.
Stage Right: Enter Golden Helix GenomeBrowse™
The Product Development team at Golden Helix has spent a good chunk of the last year heads down on something we think might just change the game when it comes to visualizing genomic data.
It was built from the ground up on the principles of superb user experience and on a foundation of solid engineering.
So, naturally, the first thing I did when I saw that BAM file show up in my data folder was to double-click the file and watch Golden Helix GenomeBrowse™ start rendering my exome.
If you have ever installed Google Earth, you may have experienced the addictive thrill of taking the view of the entire earth as globe and smoothly scrolling your mouse wheel while you strategically adjust the exact point in which you fly from outer space down to your own back yard.
Let me tell ya, it’s more addictive when you’re flying from a list of chromosomes, to a single arm of chromosome 10, down past the chromosome banding stains, through a gene cluster and into an exon of your own gene that has an interesting highlighted variant. And then, intuitively, you click and grab that plot to pan around.
It’s just… fun.
But just like staring at a globe and wondering where to zoom next, you quickly decide it might make sense to type in the address you want to investigate instead of panning around the whole city looking for a street name to pop up.
In my exome PDF report, there were 14 variants that were rare heterozygous nonsynonymous mutations in genes with known clinical implications.
Here is an example of the first variant (and in fact, one of the most interesting):
What’s immediately important about this variant is that it changes one amino acid in the transcribed protein this gene encodes for, it’s quite rare, and it occurs in a gene linked with severe genetic disorders.
Here is what it looks like when I type in chr10:50680422 into the GenomeBrowse location bar:
Reading up on ERCC6 in the OMIM database (a repository of publications related to Mendelian disease and the genes implicated), it looks like ERCC6 is an important gene involved in DNA repair and gene regulation. It has been associated with Age Related Macular Degeneration (ARMD), UV light sensitivity, and a severe rare disease causing skeletal and developmental issues called Cockayne syndrome.
By looking into the details of these published findings, it’s clear that ERCC6 is haplosufficient, as there were a number of reports of carriers of the studied mutations being unaffected.
But how likely is it that this C to T mutation actually does “knock out” or inactivate one of my two copies of ERCC6? Well, I used SVS to annotate this variant with various public databases and tools to find out.
|1kG Overal Freq||European Freq||Asian Freq||African Freq||NHLBI Freq||NHLBI 6500 Genotype Counts|
Frequency information for chr10:50680422 C/T
|HGVS Protein||PolyPhen2 Score||PolyPhen2 Prediction||GERP++ RS||PhyloP||PhastCons|
Functional Prediction and Conservation of chr10:50680422 C/T
It turns out this variant is not only rare, but present only in Europeans. Its frequency is even lower in the NHLBI Exome Sequencing Project than in the 1000 Genomes sample set. Out of the 6,503 individuals from the NHLBI Exome Sequencing Project, not a single one has this variant in both copies of ERCC6 (as a homozgyous variant) and only 19 have it in one copy of ERCC6.
Besides it just being rare, there are tools such as PolyPhen2 to measure how likely a given mutation is to damage or make inoperable the protein encoded by a gene. These tools mine databases of protein structure information and also look at how well conserved a given amino acid is across all the species that share a given gene. GERP++ Rejected Substitutions scores, PhyloP, and PhastCons are all measurements of conservation as well.
In summary, there is a good chance ERCC6 p.Arg975Gln is indeed a damaging mutation, and it’s probably a good thing my wife is not also a carrier of this variant.
Rare Variant? Rare in Which Population?
When annotating my variants in SVS, I noticed something about the 14 variants 23andMe had prioritized. Many of them, like chr10:50680422 C/T, where unique to Europeans.
In fact, if I were to use the allele frequency within just the 379 Europeans from the 1000 Genome project, half of the 14 “rare” variants have an allele frequency greater than or equal to 1%.
I once heard at a conference by a population geneticist: All variants are common in some sub-population somewhere on the globe.
So, when prioritizing the variants in my exome to investigate, I ranked them by their European allele frequencies and secondly by the annotations from SIFT, PolyPhen2, GERP++, and PhyloP. Luckily, all of these sources are aggregated in the dbNSFP 2.0 track we just added to SVS.
After ERCC6, a variant in the gene STAR came up next.
Reading through the OMIM summary, it’s clear that like ERCC6, STAR is haplosufficient such that carriers of a single mutation are asymptomatic.
Reading about which very specific organs this gene is found to be active in, and what vital hormones it’s involved in regulating, I found myself a bit unsettled in my decision to be blogging about it. Let’s just say my manhood is hostage to those 20 measly reads that show I have at least one working copy of this gene.
If fact, the assumption that my C>T mutations is damaging to STAR is far from certain. The PolyPhen2 prediction is that the mutation is “Benign”, yet it is at a highly conserved locus.
In a clinical diagnostic setting, this bioinformatic information would simply motivate a more definitive orthogonal test if we had worked down to a short list of putative causal variants. For example, a western blot assay could be done to validate the reduction or annihilation of the STAR protein product in the source tissue.
So much information!
While it’s fun browsing my exome, as I knew going into this, there is only so much analysis to dream up with a single healthy individual.
Over the next couple of weeks, when I have a few hours to spare, I’ll be taking a deeper dive into my full trio.
With the possibility of finding recessively inherited variants, looking for de Novo mutations, or playing with ways to interpret the genetic model of a complex autoimmune disease in the context of a single exome, I’m sure to have many interesting things to report.