Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2025 Sep 2:19:11779322251358314.
doi: 10.1177/11779322251358314. eCollection 2025.

Language Modelling Techniques for Analysing the Impact of Human Genetic Variation

Affiliations
Review

Language Modelling Techniques for Analysing the Impact of Human Genetic Variation

Megha Hegde et al. Bioinform Biol Insights. .

Abstract

Interpreting the effects of variants within the human genome and proteome is essential for analysing disease risk, predicting medication response, and developing personalised health interventions. Due to the intrinsic similarities between the structure of natural languages and genetic sequences, natural language processing techniques have demonstrated great applicability in computational variant effect prediction. In particular, the advent of the Transformer has led to significant advancements in the field. However, transformer-based models are not without their limitations, and a number of extensions and alternatives have been developed to improve results and enhance computational efficiency. This systematic review investigates over 50 different language modelling approaches to computational variant effect prediction over the past decade, analysing the main architectures, and identifying key trends and future directions. Benchmarking of the reviewed models remains unachievable at present, primarily due to the lack of shared evaluation frameworks and data sets.

Keywords: Variant effect prediction; evolution of language models; genomics; large language models; small language models.

PubMed Disclaimer

Conflict of interest statement

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

SNP in coding DNA vs. non-coding DNA for transcription
Figure 1.
Illustration of coding vs non-coding DNA, and an SNP in a promoter region, for a eukaryotic cell. Non-coding DNA consists of transcription factors, such as promoters, and transcription factor binding sites. Promoters drive the initiation of transcription. Other cis-regulatory elements (CREs) include enhancers and silencers, which positively and negatively regulate gene expression, respectively. Insulators are an additional type of CRE, which interact with nearby CREs and can block distal enhancers or regulate chromatin interactions. Source: Created in BioRender. Hegde (2024) https://BioRender.com/e16b233.
The diagram depicts a language model pipeline for variant effect prediction, classified into genomics (Dna, RNA, and Protein sequences), proteomics, and an optional MSA. The process includes tokenization of sequences, input into a language model, and further classification of variant effect tasks like pathogenicity or disease association, as well as regression of variant effects and functional effects.
Figure 2.
Generic language modelling pipeline, including the main categories of tasks covered in this review. The DNA, RNA, or protein sequences are tokenised before being input to the model. The model is initially pre-trained on a large corpus of data, and then fine-tuned on a data set specific to the planned downstream tasks, eg, variant pathogenicity classification. Source: Icons from BioRender https://app.biorender.com/.
The diagram is a timeline of developments in language models from 1980 until the development of the transformer in 2017. It starts with the use of grammars, markov models, and n-gram models in the 1980s. The timeline moves to the year 2001, where the first neural language model, a feedforward neural network (FFNN), was developed. This was followed by the development of convolutional neural networks (CNN), long short-term memory (LSTM), and recurrent neural network (RNN) language models. The timeline culminates in 2017 with the development of the transformer, which combines attention mechanisms and convolutional neural networks (CNN). Classical ML refers to classical machine learning techniques such as support vector machine and Naive Bayes. Markov models are often used to construct grammars.
Figure 3.
Timeline of models from 1980 until the development of the transformer. Classical ML refers to classical machine learning techniques such as support vector machine and Naive Bayes. FFNN, feed-forward neural network; CNN, convolutional neural network; LSTM, long short-term memory. Markov models are often used to construct grammars.,
Describe the information in the image.
Figure 4.
Timeline of developments in NLP since 2017.
Comparison of self-attention, Multi-head attention, Hyena operator, and Mamba operator mechanisms in neural network architectures.
Figure 5.
Comparison of the self-attention mechanism and alternatives. (A) Scaled dot-product attention, as shown in Avsec et al. The attention mechanism is applied simultaneously to a set of queries Q, with keys K and values V . Hence, the output matrix is computed as: Attention(Q, K,V) = softmax(QK T / p[dk])V . MatMul = matrix multiplication. The Mask between the Scale and Softmax is used only in the decoder to preserve the auto-regressive property, by preventing the flow of data from right to left. (B) Multi-head attention, as shown in Avsec et al. The presence of h heads indicates that h attention layers run in parallel. (C) Hyena operator of order N, as shown in Nguyen et al. Combinations of dense layers and convolutions are applied to the input; the resulting projections are then fed to the element-wise gate layers. An MLP is used to implicitly parameterise the long convolutions, hence producing the convolutional filters. x indicates the input. (D) Mamba operator, adapted from Gu and Dao. The Mamba operator combines a state space model (SSM) with an MLP. x indicates the input. For the activation function σ, either a sigmoid linear unit or Swish is used.
Caption A: Transformer architecture with encoder-decoder, as per Vaswani et al.16; Caption B: Detailed transformer, from Vaswani et al.16, includes multi-head and masked attention; Caption C: Encoder-only transformer, adapted from DNABERT.35’s model; Caption C: Decoder-only transformer, adapted from GPT-1.111’s design.
Figure 6.
Transformer architectures. (A) High-level representation of the encoder-decoder architecture comprising the vanilla transformer architecture. The encoder encodes the input sequence into a representation, which is stored as a latent state. The decoder decodes this representation into an output sequence. This is passed into the linear and softmax layers to produce the output predictions. (B) Detailed transformer architecture, adapted from Vaswani et al. The multi-head attention modules consist of multiple self-attention modules used in parallel. These are stacked with fully connected layers to create an encoder-decoder model as shown in (A). (C) Encoder-only transformer architecture, adapted from DNABERT. (D) Decoder-only transformer architecture, adapted from GPT-1.
A graph in Figure B shows training time in GPU hours for different state-of-the-art LLMs, indicating an upward trend with a few exceptions.
Figure 7.
Input sequence length, number of parameters, and training time for models which have reported these statistics in the original papers. (A) Maximum input sequence length (x-axis) and number of parameters (y-axis) as reported in original papers for each model. The model names are indicated on the chart. There is no clear trend shown over time. Compared to the majority of transformer-based models, Caduceus, a Mamba-based model, has far fewer parameters and can handle longer input sequences. (B) Training time in GPU hours for state-of-the-art LLMs. GPU hours = number of hours × number of GPUs. In general, the training time required for LLMs has increased over the years. However, DNABERT and ESM-1b are outliers, having very high training times; this is likely due to the fact that both are foundation models, which were trained on very large data sets. The GPN-MSA is another outlier and has a particularly low training time, likely due to the use of retrieval augmented processing to increase computational efficiency.
Heatmap from Luo et al28 showing DNA editing importance with Deep SHAP, for five cell lines. Y-axis: data sets, X-axis: nucleotide position. Colors show nucleotide importance, 1/−1 for positive/negative influence. ’N’ is PAM’s flexible nucleotide, ’G’ and ’G’ the PAM sequence itself.
Figure 8.
Heatmap adapted from that produced in Luo et al using the Deep SHAP method. Evaluation was done on 5 independent data sets, each for a different cell line. The y-axis denotes the data set, while the x-axis denotes the nucleotide position. The colours indicate the importance of the nucleotide position towards the predicted class label; the legend is shown on the right-hand side. 1 and −1 respectively indicate a significant positive or negative contribution. A key element of the CRISPR-Cas9 DNA editing system is the single-guide RNA sequence consisting of a 20-nucleotide protospacer and a 3-nucleotide protospacer adjacent motif (PAM) sequence. The ‘N’, ‘G’, and ‘G’ positions represent the PAM sequence, which consists of any 1 nucleotide (N) followed by 2 guanines (GG).
The graphs depict the number of published papers and citations for high-impact papers between 2020 and 2024. The total number of papers published per year has generally increased, with a slight decrease from 2023 to 2024. The number of LLM papers has far exceeded the number of neural LM papers each year. The number of citations per year for these papers has steadily increased since their publication.
Figure 9.
Analysis of the number of published papers and the number of annual citations for the highest-impact papers. (A) Number of papers published per year on language models for variant effect prediction, as described in Tables 1, 2, and 4. Neural LM is the neural language model (Table 1). The LLM refers to both Transformer-based and post-transformer models (Tables 2 and 4). During the period 2018 to 2024, the overall number of papers per year has generally increased, with a slight decrease from 2023 to 2024. The number of LLM papers has far exceeded the number of neural LM papers each year. (B) Number of citations per year for the most impactful papers. The number of citations per year for these papers has steadily increased since their publication.

References

    1. Karki R, Pandya D, Elston RC, Ferlini C. Defining ‘mutation’ and ‘polymorphism’ in the era of personal genomics. BMC Med Genomics. 2015;8:37. - PMC - PubMed
    1. Goetz LH, Schork NJ. Personalized medicine: motivation, challenges, and progress. Fertil Steril. 2018;109:952-963. - PMC - PubMed
    1. Jamuar SS, Tan E-C. Clinical application of next-generation sequencing for Mendelian diseases. Hum Genomics. 2015;9:1-6. - PMC - PubMed
    1. Rahit KMTH, Tarailo-Graovac M. Genetic modifiers and rare Mendelian disease. Genes. 2020;11:239. - PMC - PubMed
    1. Castro-Giner F, Ratcliffe P, Tomlinson I. The mini-driver model of polygenic cancer evolution. Nat Rev Cancer. 2015;15:680-685. - PubMed

LinkOut - more resources