Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 27;25(3):bbae178.
doi: 10.1093/bib/bbae178.

Scoring alignments by embedding vector similarity

Affiliations

Scoring alignments by embedding vector similarity

Sepehr Ashrafzadeh et al. Brief Bioinform. .

Abstract

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.

Keywords: alignment distance; amino acid scoring matrices; sequence alignment; sequence similarity; word embedding.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Cosine similarity of vectors formula image and formula image is the cosine of the angle, formula image, between them: (A) similar vectors: formula image, formula image; (B) orthogonal (independent) vectors: formula image, formula image; (C) opposite vectors: formula image, formula image.
Figure 2
Figure 2
Heatmaps of (A) BLOSUM45 matrix (scaled to formula image) and three aligned matrices of average formula image-scores for the NBD_sugar-kinase_HSP70_actin MSA: (B) ProtT5-score, (C) ESM2-score and (D) ProtAlbert-score.
Figure 3
Figure 3
Pearson correlations between 5 BLOSUM matrices and 12 formula image-score matrices. For each embedding method we considered, the aligned and unaligned matrices for the NBD_sugar-kinase_HSP70_actin MSA have been included.
Figure 4
Figure 4
(A) The ratio between the performance of BLOSUM45 and ProtT5-score in terms of distances to the reference from their alignments. The results are sorted by increasing MSA length. Higher than 1 (above the formula image line) indicates ProtT5-score is better. (B) The percentage of cases when ProtT5-score and BLOSUM45, respectively, are producing better results, sorted increasingly by MSA length. The results for the same distance are plotted in the same colour, with solid line for ProtT5-score and dashed line for BLOSUM45.

References

    1. Altschul SF, Gish W, Miller W, et al. . Basic local alignment search tool. J Mol Biol 1990;215(3):403–10. - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, et al. . Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25(17):3389–402. - PMC - PubMed
    1. Dayhoff M, Schwartz R, Orcutt B. 22 a model of evolutionary change in proteins. Atlas of protein sequence and structure 1978;5:345–52.
    1. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci 1992;89(22):10915–9. - PMC - PubMed
    1. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space arXiv preprint arXiv:1301.3781. 2013.

Publication types

LinkOut - more resources