
We are excited to announce the addition of CADD 1.7 as an annotation track in VarSeq. The latest release of CADD incorporates a variety of new annotations into its model, resulting in significant improvements in variant scoring. This includes the integration of Meta ESM scores and improved performance on non-coding variants.
Meta AI Evolutionary Scale Model
One of the most exciting updates in CADD 1.7 is the integration of Meta’s Evolutionary Scale Model (ESM), a protein language model for variant effect prediction. This model was trained directly on protein sequence databases to capture the functional effects of a variety of mutation types, including amino acid substitutions, inframe insertions/deletions, and loss of function mutations. The model is designed to understand protein function at a molecular-to-atomic level, making it one of the most accurate tools available for variant effect prediction. By incorporating ESM-1v scores, CADD 1.7 significantly enhances its ability to assess the functional impact of protein-coding variants.
Improved Coverage of Non-Coding Variants
CADD 1.7 also delivers improved scores for non-coding mutations, including variants in intergenic regions, introns, and untranslated regions (UTRs). This improvement is driven by the integration of several new annotations:
- Zoonomia: This resource provides comprehensive conservation data across 200 mammalian species, offering critical insights into functional regions of the genome, even in sparsely annotated non-coding areas like introns and intergenic regions. It is, therefore, one of the most important predictors in sparsely annotated non-coding regions.
- APARENT2: A deep learning model that quantifies the potential of 3’ UTR variants to disrupt alternative polyadenylation, a key mechanism in post-transcriptional gene regulation.
- RegSeq: A convolutional neural network (CNN) model trained on open chromatin sequences as a proxy for regulatory regions of the genome.
The inclusion of these annotations greatly improves CADD’s performance in scoring non-coding variants, making it an invaluable tool for whole-genome analysis. While such non-coding variants are often overlooked, they can have significant impacts on gene regulation and splicing. With CADD 1.7, researchers can better prioritize and interpret these variants, making it an essential tool for whole-genome sequencing (WGS) analysis.
Performance Improvements
While the overall performance of CADD 1.7 remains highly consistent with CADD 1.6, it demonstrates clear advantages in specific areas. Specifically, CADD 1.7 outperforms the previous version in predicting the impact of missense and UTR variants, thanks to the incorporation of specialized annotations. For these variants, not only does CADD 1.7 outperform its predecessor, but it also outperforms domain-specific models such as ESM-1v and APARENT2.
Conclusion
With the integration of CADD 1.7 into VarSeq’s annotation library, users gain a powerful tool for variant prioritization. This update is particularly valuable for identifying potentially clinically relevant non-coding variants, which are often overlooked in traditional analyses. If you have any questions about CADD 1.7 or how to incorporate this annotation into your VarSeq projects, please reach out to us at [email protected].