There are many approaches that one might use to define a variant as potentially deleterious. For example, we often see analysis workflows based on rare, non-synonymous variants, perhaps incorporating additional annotation sources that capture known or predicted consequences of coding variants. Annotations for coding regions of the genome are relatively abundant and familiar to genome scientists. We are comfortable in our ability to interpret SIFT scores, for example. Despite our collective comfort with exome analysis, there is no question that many of the secrets of the genome are found elsewhere. VarSeq users often ask about annotation sources to assist in the interpretation of non-coding regions. dbscSNV is a useful annotation source that may help you start to move beyond the coding exome by considering some of its nearest neighbors: splice site variants.
dbscSNV captures data for splicing consensus regions—these are stable sequences around the intron/exon boundary that serve as a guide in the process of splicing exon transcripts to create mRNA. It includes an 11-base region near the 5’ end of each intron (aka the “donor”) and 14 bases near the 3’ end (or “acceptor”). Variations in the conserved sequence motifs at splice sites may result in abnormal gene splicing, leading to gene silencing or other detrimental effects. dbscSNV is a database of pre-computed prediction scores for all possible SNVs that may occur in splice consensus regions, giving an indication of whether each variant is expected to affect the splicing of the gene.
There are two scores given for each variant predicted by dbscSNV, called “Ada” and “RF” scores. These are ensemble scores, derived from the outputs of several machine learning algorithms. Both are scaled from 0 and 1, and higher values indicate a greater probability that the variant will alter the splicing of the gene. The developers suggest using 0.6 as a threshold value for dichotomous effects.
I used VarSeq to apply dbscSNV annotations to variants from the whole-genome sequence for the standard reference sample NA12878. Out of nearly 4.9M variants in the sample, 3384 variants occur in the splicing consensus regions and have Ada scores from dbscSNV. The Ada score predicts that 151 of these will be damaging based on the suggested threshold. The predicted damaging variants tend to be rare, which may be expected given the conserved nature of splice site sequences. Overall, about 12% of scored variants have an allele frequency below 0.005 in the 1000 Genomes Project, but 43% of variants classified as damaging are similarly rare. Most of the prediction scores tend to be near 0 or 1, as seen in Figure 1.
Figure 1: VarSeq screenshot showing distribution of Ada splice-altering prediction scores for rare variants (1000 Genomes frequency < 0.005) in splice consensus regions.
The RF scores are quite similar to the Ada scores. A total of 158 variants are classified as damaging by one of the two methods. Figure 2 shows a VarSeq filter chain that is configured to select just those variants. The figure also shows the classification distribution of those variants as determined by VarSeq. Many of the variants are in categories that VarSeq considers to be missense or loss-of-function (LOF) variants; these variants would likely be identified by standard filtering workflows. But nearly half of the variants are in categories not usually considered to be loss-of-function, including intronic, intergenic, synonymous and splice-region variants. These variants might be overlooked in standard filtering workflows.
Figure 2: This VarSeq screenshot shows that 158 splice region variants are predicted to be damaging by either the RF (n=121) or Ada (n=151) method. 22 of those variants are classified as either synonymous, intronic, or intergenic, and would probably not be included in standard filtering workflows.
The dbscSNV annotations clearly have the potential to help VarSeq users identify potentially deleterious variants that won’t always be captured in exome filtering workflows. The VarSeq filter chain may be configured to select splice altering variants as an endpoint (similar to the example in Figure 2), or it may be configured to capture them together with other categories of variants (as seen in Figure 3). As always, the user has ultimate control of how the filters are configured.
Figure 3: Screenshot of a VarSeq filter chain, configured to select splice-altering variants together with rare functional variants. Beginning with 4,876,259 variants, 4,657,491 variants pass the read depth filter (filter details are hidden). Those variants then are passed on to a set of three parallel filters to assess potential deleterious effects: 1) 136 splice variants are predicted damaging by Ada; 2) 106 splice variants are predicted damaging by the RF method; and 3) 2109 variants pass the filter for rare missense or LOF variants. A total of 2219 variants are selected by at least one of the three filters, and are passed on to the ensuing Zygosity filter (filter details are hidden).
Golden Helix devotes a great amount of time to curating and maintaining genomic annotations for use in our VarSeq, SVS and GenomeBrowse software packages. We hope that you find them helpful in your work. We expect that the annotation library will continue to grow. Please contact us if you have questions about any of our current annotation sources or if there is a data source that you would like to see included in the repository.