When researchers realized they needed a way to report genetic variants in scientific literature using a consistent format, the Human Genome Variation Society (HGVS) mutation nomenclature was developed and quickly became the standard method for describing sequence variations. Increasingly, HGVS nomenclature is being used to describe variants in genetic variant databases as well. There are some practical issues that researchers should keep in mind when using HGVS notation with databases or any other sort of automated tool. I was recently involved in a project that attempts to automatically match DNA variations in a sample against a database of known pathogenic protein variants stored in HGVS notation. What my colleagues and I found is that matching against variants represented in HGVS notation can be very tricky.
The problem is that for a given sequence variant, there are potentially many representations in HGVS nomenclature. Researcher A may choose one description of a pathogenic variant and then add it to a clinical database. Researcher B may encounter the same variant in a patient, but represents it differently than A. So, when B searches the database, he/she fails to find potentially vital information.
I should point out that I’m not trying to criticize HGVS. Their nomenclature was designed to produce concise, accurate, and human readable descriptions of variants for publication in academic journals, and I consider it a success in this regard. But, the nomenclature is starting to be used in ways that I don’t think the authors originally intended, i.e. automated analysis and database queries. In my opinion the nomenclature is still too ambiguous to be used in any sort of automated analysis.
Some Examples of Ambiguities
In some cases, HGVS clearly states how to resolve ambiguities. In other cases, the guidelines are murky. Consider the following short genomic DNA sequences:
Ref: GAAC Alt: GTTC
At first glance this appears to be a 2 nucleotide substitution, described as “g.2_3delinsTT.” However, note that TT is the reverse complement of AA, so it could also be represented as an inversion: “g.2_3inv.” Which is correct? The guidelines don’t say. The notation hints at what mechanism caused this variation (inversion or substitution), but the reality is that we can’t actually know the mechanism behind some mutations. So, researchers are forced to choose arbitrarily. Insertions are also tricky. Consider:
Ref: GAAAC Alt: GAAAAC
The guidelines do clearly state that duplication is preferred over insertion, so we might describe this as “g.4dup.” However, noting the long run of adenine nucleotides, a researcher might prefer to describe this as a repeated sequence: “g.2A.” Both notations accurately describe the variant, so which is preferred? In practice every database and tool I’ve seen so far describes this as either a ‘dup’ or an ‘ins’. This begs the question, when (if ever?) should a researcher use the repeat notation? The HGVS guidelines aren’t entirely clear on the issue.
Further complicating the issue is that many databases and software (dbSNP for example) ignore the recommendation to favor ‘dup’ over ‘ins,’ reporting this as “g.4_5insA.” The reason is that in order to differentiate between ‘dup’ and ‘ins,’ software must load and examine the flanking reference sequence around each variant, which slows performance considerably for software that has to process millions of variants.
Another issue is that HGVS guidelines state that insertions and deletions should be reported at the most 3′ (rightmost) possible position. However, many variant callers report insertions at the 5′ (leftmost) position. As such, software which converts variant calls to HGVS notation often puts the insertions in the “wrong” place according to the HGVS guidelines. We have this dilemma with SVS. Because our software sits downstream from variant callers, we have to make a difficult decision. Do we honor the variant caller or honor the HGVS guidelines? Perhaps it would be more “correct” to honor the HGVS guidelines, but what about confusing customers who are trying to figure out why their HGVS insertion positions don’t agree with their VCF file?
There are also ambiguities when trying to decide the correct notation for protein sequence changes. Consider a hypothetical gene with the following short coding DNA sequence:
Ref: ATG GAT ACA AGA CTT TAG
Ref: Met Asp Thr Arg Leu *
If a deletion of 6 nucleotides at positions 10 through 15 occurs (c.10_15del), we get an altered protein with the last two amino acids removed: This is described as “p.Arg4_Leu5del.”
Alt1: ATG GAT ACA TAG c.10_15del
Alt1: Met Asp Thr * p.Arg4_Leu5del
Now, if instead the C at position 10 gets changed to a T (c.10A>T), we get an early stop codon:
Alt2: ATG GAT ACA TGA CTT TAG c.10A>T
Alt2: Met Asp Thr * p.Arg4*
Note that this produces the exact same protein as the deletion case. How do we describe this change? The HGVS guidelines state that “Descriptions at protein level should describe the changes observed on protein level and not try to incorporate any knowledge regarding the change at DNA-level.” So it seems to me it should have the exact same description “p.Arg4_Leu5del.” However, HGVS examples suggest that “p.Arg4*” is preferred. The HGVS spec appears to be in conflict with itself in this regard. By representing the same protein change in two different ways, they are by definition leaking through some additional information about the underlying DNA-level change. HGVS frameshift notation suffers from the same issue. It’s not possible to determine if a frameshift occurs without examining the underlying DNA mutation, which in turn leaks some information about the underlying DNA change.
Discrepancies Turn into Bugs
I could go on listing more examples of ambiguities, but I think you get the idea. I realize that these ambiguities and discrepancies aren’t a big deal when trying to describe a variant in a journal article (which is what HGVS nomenclature was designed to do). However, when converting guidelines into working software, these ambiguities and discrepancies turn into bugs. One programmer chooses one way to resolve the ambiguity, and some other programmer working on another project might choose a different alternative. As a result, they produce software that appears to be compatible, but in reality will produce incorrect results when used together. This is the reason I get nervous whenever I need to work with variants represented in HGVS format.
Again, I hope I don’t sound overly critical towards the HGVS. Describing variants is a tricky subject, and the HGVS has gone a long way in improving the accuracy and consistency of variant reporting. But there is still work to be done. For every case where a variant can be represented in multiple ways, there should be a well defined canonical form that is preferred over the others. Even better would be if the HGVS could define the exact process that a person (or computer) has to follow when converting a sequence variation into HGVS format. I’ve been working towards better HGVS compliance in our software, but I’m having a difficult time deciding how to handle all of the use-cases we encounter. HGVS provides many helpful examples of their notation, but a large collection of examples won’t resolve all ambiguities. Having a well defined algorithm for generating HGVS notation would help ensure that various software implementations all arrive at the same result.
So, what should researchers who encounter HGVS notation keep in mind? A single variant may have many possible descriptions using HGVS notation, and it isn’t always clear which is preferred. Even in the cases where the guidelines are clear about the preferred notation, not all software and databases follow the guidelines. Furthermore, the guidelines are still evolving, so HGVS descriptions that are considered correct today may not be so in the future. I advise researchers to be especially cautious when comparing variant nomenclature from two different sources.