Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 28;120(9):e2211823120.
doi: 10.1073/pnas.2211823120. Epub 2023 Feb 24.

Improved global protein homolog detection with major gains in function identification

Affiliations

Improved global protein homolog detection with major gains in function identification

Mesih Kilinc et al. Proc Natl Acad Sci U S A. .

Abstract

There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a language model to represent proteins numerically in a matrix (an embedding) and uses discrete cosine transforms to compress the data to extract the most essential part, significantly reducing the data size. This PRotein Ortholog Search Tool (PROST) is significantly faster with linear runtimes, and most importantly, computes the distances between pairs of protein sequences to yield homologs at significantly lower levels of sequence identity than previously. The extent of allosteric effects in proteins points out the importance of global aspects of structure and sequence. PROST excels at global homology detection but not at detecting local homologs. Results are validated by strong similarities between the corresponding pairs of structures. The number of remote homologs detected increased significantly and pushes the effective sequence matches more deeply into the twilight zone. Human protein sequences presently having no assigned function now find significant numbers of putative homologs for 93% of cases and structurally verified assigned functions for 76.4% of these cases. The data compression enables massive searches for homologs with short search times while yielding significant gains in the numbers of remote homologs detected. The method is sufficiently efficient to permit whole-genome/proteome comparisons. The PROST web server is accessible at https://mesihk.github.io/prost.

Keywords: function identification; homolog; protein language models; proteins; sequence search.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
The PROST architecture and parameter optimization. (A) The PROST architecture. A protein sequence is fed into the ESM1b language model to obtain embeddings that are reduced to maximize accuracy for remote homolog detection. Accordingly, two representations are carried through, with every protein being represented by two different matrices for two different compression levels, chosen during optimization (SI Appendix). (B) An example—a PROST search for HPO30 protein (having no previously known human homolog) against the SwissProt database. The PROST distance distribution is similar to a normal distribution for nonhomologs. Putative homologous proteins are outliers in this distribution. Robust z-scores with Bonferroni corrections are used to calculate the expectation values of randomly finding such a homolog. The CLRN3 protein was found as a putative homolog for HPO30. (C) Sequence alignment and structures of HPO30 and CLRN3, showing validation by similarity in structures. Global alignment with the ProtSub matrix (12) yields a 21.5% sequence identity. Helixes are colored red; beta sheets are colored blue. Structures are from AlphaFold2 predictions (29), with HPO30 on the right and CLRN3 on the left. (D) Visualizations of PROST representations for HPO30, CLRN3, and the differences between them. The sum of all elements in the difference matrices gives the PROST distance, in this case, 3975 as shown in part b.
Fig. 2.
Fig. 2.
ROC plots for max50 benchmarking dataset. This dataset contains proteins with a limitation of 50 undefined amino acids in between the defined regions, constraining the homology test to global homology detection. The plots show the overall performance of tested methods as true-positive and false-positive rates. We ranked each curve based on their performance on the first 1,000 false positives measured by the AUC1000 metric. ROC plots for each database, Pfam (A), Gene3d (B), and Superfamily (C) are shown separately. Overall, PROST is the best tool in this dataset producing the highest AUC1000 scores. This clearly demonstrates the importance of the optimization that has been developed here by comparing the ESM1b results against the optimized PROST results.
Fig. 3.
Fig. 3.
Examples of putative homolog prediction by PROST, validated by structural similarity. Several cases of human proteins that presently have no current GO annotations (Top structure in each box) together with hits found by PROST (Bottom). Structures are AlphaFold2 (29) predictions unless otherwise noted. Structural alignments were done with the TM-Align tool. A TM-Score ≥ 0.5 indicates the same structural fold (38). Sequence alignments are done using the ProtSub substitution matrix (12) with a gap opening penalty of 5 and extension penalty of 1. Identical residues are shown in bold. Helixes are colored red; beta sheets are colored blue. (A) Human EOLA2 and Zymomonasmobilis subsp. ASCH domain-containing ribonuclease (PDB ID: 5GUQ). (B) Human C14orf28 and Pongo abelii Ubl carboxyl-terminal hydrolase 18. (C) Human methyltransferase-like 26 and Cereibacter sphaeroides phosphatidyl ethanolamine N-methyltransferase. (D) Human chromosome 20 open reading frame 204 (C20orf204) and Protopterus annectens Somatotropin. (E) Alignment of Human EOLA2 and Zymomonasmobilis subsp. ASCH domain-containing ribonuclease proteins. Sequence identity is 20.5%. (F) Alignment of Human C14orf28 and Pongo abelii Ubl carboxyl-terminal hydrolase 18 proteins. Sequence identity is 21.7%. (G) Alignment of human methyltransferase-like 26 Cereibacter sphaeroides phosphatidylethanolamine N-methyltransferase proteins. Sequence identity is 23.7%. (H) Alignment of human C20orf204 and somatotropin. Sequence identity is 22.3%.
Fig. 4.
Fig. 4.
Two example case with no clear structural similarity (A) and a case with low sequence identity (B). Human chromosome 11 open reading frame 53 (C11orf53) was an uncharacterized protein in March 2022. A recent paper (39) experimentally showed that it is a transcriptional coactivator of POU2F3 and plays a role in the generation of Tuft cell lineage. The putative homolog found by PROST is POU2AF1, which is a transcriptional coactivator that associates POU2F1 and POU2F2. C11orf53 has a similar function with the POU2AF1 found by PROST, validating the homology relationship. PROST identifies a putative homolog for human MKRN2 opposite strand protein (H3BPM6) with a 16.4% sequence identity. (A) Structures predicted by AlphaFold2 (29) for C11orf53 (Left) and POU2AF1 (Right) show no clear similarity. (B) Structures predicted by AlphaFold2 (29) for H3BPM6 (Left) and Q8BVA2 (Right). The alignment of these structures has a FATCAT p-value of 2.91e-03 and a 0.47 TM-Score. (C) Sequence alignments of C11orf53 and its putative homolog POU2AF1 have 22.2% sequence identity. (D) Sequence alignments of human MKRN2 opposite strand protein (H3BPM6) and Q8BVA2 have 16.4% sequence identity.

References

    1. Needleman S. B., Wunsch C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970). - PubMed
    1. Smith T. F., Waterman M. S., Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981). - PubMed
    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J., Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990). - PubMed
    1. Pearson W. R., Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 183, 63–98 (1990). - PubMed
    1. Rost B., Twilight zone of protein sequence alignments. Protein Eng. Des. Sel. 12, 85–94 (1999). - PubMed

Publication types

LinkOut - more resources