Evaluating DeepMind’s AlphaMissense Classifier

         October 20, 2023

Last month, the researchers at Google DeepMind announced the release of AlphaMissense, a new missense prediction algorithm that leverages the protein structure prediction model AlphaFold to distinguish between benign and pathogenic missense variants (Cheng et al., 2023).  AlphaFold is a model for the prediction of protein structures from amino acid sequences. During the development of AlphaMissense, the AlphaFold model was fine-tuned to identify variants commonly seen in human and primate populations. Variants predicted to be common are classified as benign, while variants predicted to be novel are classified as pathogenic. The authors report that AlphaMissense achieves 90% precision when used to classify a large subset of the ClinVar database.

While these results are impressive, we wanted to see for ourselves how this new algorithm stacks up against the other in-silico predictions that are currently available in VarSeq’s vast annotation library. We compared the predictions provided by AlphaMissese to 17 other missense prediction algorithms, using a subset of the ClinVar database as a benchmark dataset. In our experiments, AlphaMissense achieves an impressive balanced accuracy of 91.5% but is outperformed by BayesDel across all metrics.

Experimental Design

We compared AlphaMissense to the following in-silico prediction algorithms currently available in VarSeq:

  1. BayesDel (Feng, 2017)
  2. REVEL (Ioannidis et al., 2016)
  3. MetaSVM (Dong et al., 2015)
  4. MetaLR (Dong et al., 2015)
  5. MetaRNN (Li et al., 2022)
  6. DEOGEN2 (Raimondi et al., 2017)
  7. PROVEAN (Choi et al., 2012)
  8. FATHMM (Shihab et al., 2013)
  9. FATHMM-XF (Rogers et al., 2018)
  10. FATHMM-MKL (Shihab et al., 2015)
  11. PolyPhen-2 HVAR (Adzhubei et al., 2013)
  12. LRT (Chun et al., 2009)
  13. SIFT (Ng et al., 2003)
  14. LIST S2 (Malhis et al., 2020)
  15. PrimateAI (Sundaram et al., 2018)
  16. MutationTaster (Schwarz et al., 2010)
  17. CADD (Kircher et al., 2014)

For AlphaMissense, we classified variants as pathogenic if the score exceeded 0.564, and classified variants as benign if the score was below 0.34, as recommended by the authors. For CADD, variants were classified based on the raw scores using a threshold of 5 for a pathogenic classification and a threshold of 2 for a benign classification. For REVEL, we used a threshold of 0.5, which was the threshold that produced the best performance, according to Ioannidis et al. (2016). For all other prediction algorithms, we used the thresholds specified in the dbNSFP Function Predictions annotation source (Liu et al., 2020).

The performance of each algorithm was evaluated using the ClinVar repository for variant interpretations (Landrum et al., 2016). Variant classifications in ClinVar are contributed by both clinical and research laboratories, as well as expert panels comprised of medical professionals with a long-standing scope of work. Variants in ClinVar are classified using a five-tier system, and each variant is assigned a review status indicating the depth of evidence behind its interpretation, denoted by values ranging from 1 to 4 stars:

  1. The submission comes from a single submitter or multiple submitters with conflicting interpretations.
  2. Concurring interpretations are provided by at least two submitters with accompanying evidence.
  3. The interpretations have been reviewed by an expert panel.
  4. The ClinGen Steering Committee has evaluated and confirmed the evidence and assertion criteria, establishing its alignment with practice guidelines.

We restricted our analysis to high-quality ClinVar missense variants with a review status of at least 2 which had been classified as either Pathogenic (P), Likely Pathogenic (LP), Benign (B), or Likely Benign (LB). The truth set for ClinVar was defined as follows:

  • Positives: Variants that are classified as Pathogenic (P) or Likely Pathogenic (LP)
  • Negatives: Variants that are classified as Benign (B) or Likely Benign (LB)

Each classification provided by the prediction algorithms was categorized as either a True Positive, False Positive, True Negative, or False Negative as follows:

  • True Positive (TP): Classified as P/LP in ClinVar and classified as P by the algorithm.
  • False Positive (FP): Classified as B/LB in ClinVar but classified as P by the algorithm.
  • True Negative (TN): Classified as B/LB in ClinVar and classified as B by the algorithm.
  • False Negative (FN): Classified as P/LP in ClinVar but classified as B by the algorithm.

The number of variants falling into these categories was then used to compute the sensitivity, specificity, and balanced accuracy for each algorithm as follows:


Specific details about the ClinVar benchmark dataset are shown in the table below, including the total number of variants with each classification. This dataset was constructed using the ClinVar 2023-10-05 annotation track and contains a total of 35,468 variants.

Total Variants in ClinVar 2023-08-032,254,932
Variants with Classification of P/LP/B/LB1,070,385
2+ Star Variants with Required Classification191,149
Missense Variants with Required Classification/Status35,468
Pathogenic4,315 (12.2%)
Likely Pathogenic9,309 (26.2%)
Likely Benign14,017 (39.5%)
Benign7,827 (22.1%)
Table 1: ClinVar Benchmark Dataset

The sensitivity, specificity, and balanced accuracy statistics for each algorithm are shown in Table 2, with all algorithms sorted by balanced accuracy.

TPFNTNFPSensitivitySpecificityBalanced Accuracy
LIST S211,8991,72814,4757,20187.3%66.8%77.0%
Table 2: Benchmark Metrics for All Algorithms

Using ClinVar as a benchmark, BayesDel achieves the highest balanced accuracy (95.5%), while performing extremely well in terms of both sensitivity and specificity, with all metrics exceeding 94%. AlphaMissense has the second-highest balanced accuracy at 91.5%, while also maintaining high sensitivity and specificity metrics of 88.6% and 94.4%, respectively. REVEL also performed well, achieving the third-highest balanced accuracy, with all metrics exceeding 90%. While AlphaMissense achieves slightly higher specificity and balanced accuracy than REVEL, the overall performance between these two algorithms is comparable.


In this blog post, we compared AlphaMissense to 17 other in-silico prediction algorithms, using ClinVar as a benchmark dataset. Of these algorithms, BayesDel, AlphaMissense, and REVEL demonstrated the highest overall performance, with all three algorithms achieving high balanced accuracy without sacrificing either sensitivity or specificity. These results are consistent with the findings of Tian et al., who also compared REVEL and BayesDel to a number of other prediction algorithms and found that these two algorithms outperformed competing methods in terms of both positive and negative predictive value (Tian et al., 2019).

While AlphaMissense demonstrated excellent performance, it was outperformed by BayesDel across all metrics. The superior performance of BayesDel is likely attributable to the algorithm’s incorporation of population frequency into its model, which allows it to more easily identify benign variants. The performance of AlphaMissense relative to BayesDel is impressive, given that it does not directly incorporate any population frequency information.

Golden Helix recently released a new annotation track containing the complete set of AlphaMissense predictions for all single nucleotide missense variants from 19,000 protein-coding genes for both GRCh37 and GRCh38 coordinates. These missense predictions make an excellent addition to VarSeq’s already extensive collection of functional prediction scores. I hope this blog post helped to give you an idea of how the AlphaMissense predictions compare to other in-silico prediction algorithms. If you have any questions about functional predictions in VarSeq, please don’t hesitate to reach out to us at support@goldenhelix.com.


  1. Cheng, J., Novati, G., Pan, J., Bycroft, C., Žemgulytė, A., Applebaum, T., … & Avsec, Ž. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science, eadg7492.
  2. Feng, B. J. (2017). PERCH: a unified framework for disease gene prioritization. Human mutation38(3), 243-251.
  3. Ioannidis, N. M., Rothstein, J. H., Pejaver, V., Middha, S., McDonnell, S. K., Baheti, S., … & Sieh, W. (2016). REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. The American Journal of Human Genetics99(4), 877-885.
  4. Dong, C., Wei, P., Jian, X., Gibbs, R., Boerwinkle, E., Wang, K., & Liu, X. (2015). Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. Human molecular genetics24(8), 2125-2137.
  5. Li, C., Zhi, D., Wang, K., & Liu, X. (2022). MetaRNN: differentiating rare pathogenic and rare benign missense SNVs and InDels using deep learning. Genome Medicine14(1), 115.
  6. Raimondi, D., Tanyalcin, I., Ferté, J., Gazzo, A., Orlando, G., Lenaerts, T., … & Vranken, W. (2017). DEOGEN2: prediction and interactive visualization of single amino acid variant deleteriousness in human proteins. Nucleic acids research45(W1), W201-W206.
  7. Choi, Y., Sims, G. E., Murphy, S., Miller, J. R., & Chan, A. P. (2012). Predicting the functional effect of amino acid substitutions and indels. PLoS One, 7(10), e46688.
  8. Shihab, H. A., Gough, J., Cooper, D. N., Stenson, P. D., Barker, G. L., Edwards, K. J., … & Gaunt, T. R. (2013). Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human mutation34(1), 57-65.
  9. Rogers, M. F., Shihab, H. A., Mort, M., Cooper, D. N., Gaunt, T. R., & Campbell, C. (2018). FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics34(3), 511-513.
  10. Shihab, H. A., Rogers, M. F., Gough, J., Mort, M., Cooper, D. N., Day, I. N., … & Campbell, C. (2015). An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics31(10), 1536-1543.
  11. Adzhubei, Ivan, Daniel M. Jordan, and Shamil R. Sunyaev. “Predicting functional effect of human missense mutations using PolyPhen‐2.” Current protocols in human genetics 76.1 (2013): 7-20.
  12. Chun, S., & Fay, J. C. (2009). Identification of deleterious mutations within three human genomes. Genome research19(9), 1553-1561.
  13. Ng, P. C., & Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research31(13), 3812-3814.
  14. Malhis, N., Jacobson, M., Jones, S. J., & Gsponer, J. (2020). LIST-S2: taxonomy based sorting of deleterious missense mutations across species. Nucleic acids research48(W1), W154-W161.
  15. Sundaram, L., Gao, H., Padigepati, S. R., McRae, J. F., Li, Y., Kosmicki, J. A., … & Farh, K. K. H. (2018). Predicting the clinical impact of human mutation with deep neural networks. Nature genetics50(8), 1161-1170.
  16. Schwarz, J. M., Rödelsperger, C., Schuelke, M., & Seelow, D. (2010). MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods7(8), 575-576.
  17. Kircher, M., Witten, D. M., Jain, P., O’roak, B. J., Cooper, G. M., & Shendure, J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics46(3), 310-315.
  18. Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations for human nonsynonymous and splice-site SNVs. Genome medicine12(1), 1-8.
  19. Landrum, M. J., Lee, J. M., Benson, M., Brown, G., Chao, C., Chitipiralla, S., … & Maglott, D. R. (2016). ClinVar: public archive of interpretations of clinically relevant variants. Nucleic acids research44(D1), D862-D868.
  20. Tian, Y., Pesaran, T., Chamberlin, A., Fenwick, R. B., Li, S., Gau, C. L., … & Qian, D. (2019). REVEL and BayesDel outperform other in silico meta-predictors for clinical variant classification. Scientific Reports9(1), 12752.

Leave a Reply

Your email address will not be published. Required fields are marked *