The VarSeq clinical platform is built on a strong foundation of data curation and annotation algorithms to ensure the variants identified have all the information required to make the correct clinical assessments.
It’s easy to make light of “variant annotation”, but the details run very deep into the roots of how we represent genomic data, how public data is aggregated, stored and retrieved, and how the molecular biology changes with human variation in the genome.
If you want to understand these details better yourself, look no further than the Golden Helix blog. Here we lay out our research, comparisons, and strategies often resulting in a deep dive of the annotation algorithms themselves as well as the data curation procedures that go into making VarSeq, SVS and GenomeBrowse.
Here are the top 10 blog posts to read if you’re interested in mastering clinical grade genomic variant annotation:
10- Between Two Bases: Coordinate Representations for Describing Variants
This post provides the fundamentals of the coordinate systems used to describe both variants and the genomic features they interact with. If you’ve ever cracked open a VCF file and scratched your head, this should help clear things up and make you feel confident navigating any genomic database.
9- The New Human Genome Reference and Clinical Grade Annotations: It’s All About the Coordinates
A case can be made for and against using the new human reference genome. This post reviews how the human genome reference is fundamental to your annotation procedures, and what you need to know to make the choice that is right for your lab.
8- Variant Normalization: Underappreciated Critical Infrastructure
Before variants can be annotated for clinical interpolation, we must first agree on how to represent them. This blog post is great if you have ever wondered why a VCF file has a different start position for an InDel than dbSNP, or what the difference is between a multiple-nucleotide-polymorphism and a single-nucleotide-polymorphism.
7- Supercentenarian Variant Annotation: Complex to Primitive
Variant representation and normalization are best understood through examples. This blog post covers the complexity of transforming single-sample variant sequenced and called using the Complete Genomics Platform to a normalized and merged population catalog that can be used to compare your variants to some fascinating genomes.
6- Accurate Annotations: Updates to the NHLBI Exome Sequencing Project Variant Catalog
An interesting thing about the human reference genome is that the “reference” allele is not always the common or “major” allele in the human population. Read on to understand the difference between the MAF and the AAF and how you may want to treat “ref is minor” cases in your analysis.
5- Annotating with gnomAD: Frequencies from 123,136 Exomes and 15,496 Genomes
The gnomAD project will dramatically improve the understanding of the landscape of variants in humans, and the extra diversity represented exompared to ExAC has direct utility for many clinically testing labs. Yet it remains important, as this post outlines, to be aware of the bioinformatics artifacts embedded in this resource so as to not to misinterpret the data.
4- What’s in a Name: The Intricacies of Identifying Variants
While we think of a “gene” as one functional unit of the genome, the reality is there are many isoforms or “transcripts” that are formed in different biological contexts. Which one should you use for reporting a variant’s change to a protein sequence? This blog post reviews the situation and explains our heuristics for choosing a Clinically Relevant Transcript to report.
3- Looking Beyond the Exons: Splice Altering Variants
Underrepresented in most bioinformatics pipelines, splice site variants need their own specialized algorithms to predict and classify their impact on the gene function. This post reviews their importance as well as strategies to include likely splice mutating variants into your prioritized list of putative variants.
2- UpdatingVarSeq’s Transcript Annotation along with NCBI RefSeq Genes Interim Release
One of the most critical inputs to annotating variations and their function is knowing where they are in genes and how they change the coding and protein sequences. This requires mapping our databases of cataloged RNAs back to the DNA of the human genome. This post covers the important edge cases that arise from imperfect mappings and how those are improved in the latest updated RefSeq gene annotations we curated.
1- The State of Variant Annotation: A Comparison of AnnoVar, snpEff and VEP.
Widely cited, this post does a deep-dive into the edge cases of classifying variants impacts on gene transcripts. Although it compares these three tools at a specific period in time, the difficulties uncovered remain pertinent today. Golden Helix has since implemented its own variant classification and HGVS reporting algorithm that takes the best of VEP and snpEff into account.
I hope you find these articles to be a great resource for further understanding clinical grade genomic variant annotation. If you have any specific questions or thoughts on this topic, I encourage you to share these in the comment section below! Happy reading.