A DNA language model based on multispecies alignment predicts the effects of genome-wide variants
- PMID: 39747647
- DOI: 10.1038/s41587-024-02511-w
A DNA language model based on multispecies alignment predicts the effects of genome-wide variants
Abstract
Protein language models have demonstrated remarkable performance in predicting the effects of missense variants but DNA language models have not yet shown a competitive edge for complex genomes such as that of humans. This limitation is particularly evident when dealing with the vast complexity of noncoding regions that comprise approximately 98% of the human genome. To tackle this challenge, we introduce GPN-MSA (genomic pretrained network with multiple-sequence alignment), a framework that leverages whole-genome alignments across multiple species while taking only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC and OMIM), experimental functional assays (deep mutational scanning and DepMap) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and noncoding variants. We provide precomputed scores for all ~9 billion possible single-nucleotide variants in the human genome. We anticipate that our advances in genome-wide variant effect prediction will enable more accurate rare disease diagnosis and improve rare variant burden testing.
© 2025. The Author(s), under exclusive licence to Springer Nature America, Inc.
Conflict of interest statement
Competing interests: The authors declare no competing interests.
Update of
-
GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction.bioRxiv [Preprint]. 2024 Apr 6:2023.10.10.561776. doi: 10.1101/2023.10.10.561776. bioRxiv. 2024. Update in: Nat Biotechnol. 2025 Dec;43(12):1960-1965. doi: 10.1038/s41587-024-02511-w. PMID: 37873118 Free PMC article. Updated. Preprint.
References
-
- Marwaha, S., Knowles, J. W. & Ashley, E. A. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome. Genome Med. 14, 23 (2022).
-
- Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Proc. Advances in Neural Information Processing Systems 34 (eds Ranzato, M. et al.) 29287–29303 (Curran Associates, Inc., 2021).
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
