Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 13;118(15):e2016239118.
doi: 10.1073/pnas.2016239118.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Affiliations

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Alexander Rives et al. Proc Natl Acad Sci U S A. .

Abstract

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.

Keywords: deep learning; generative biology; protein language model; representation learning; synthetic biology.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement: A.R., J. Meier, S.G., D.G., M.O., C.L.Z., J. Ma, and R.F. are coinventors on a US patent application relating to the work of this manuscript.

Figures

Fig. 1.
Fig. 1.
Biochemical properties of amino acids are represented in the Transformer model’s output embeddings, visualized here with t-SNE. Through unsupervised learning, residues are clustered into hydrophobic, polar, and aromatic groups and reflect overall organization by molecular weight and charge. Visualization of 36-layer Transformer trained on UniParc.
Fig. 2.
Fig. 2.
Protein sequence representations encode and organize biological variations. (A) Each point represents a gene, and each gene is colored by the orthologous group it belongs to (dimensionality is reduced by t-SNE). Orthologous groups of genes are densely clustered in the trained representation space. By contrast, the untrained representation space and unigram representations do not reflect strong organization by evolutionary relationships. (B) Genes corresponding to a common biological variation are related linearly in the trained representation space. Genes are colored by their orthologous group, and their species are indicated by a character label. PCA recovers a species axis (horizontal) and orthology axis (vertical) in the trained representation space but not in the untrained or unigram spaces. Representations are from the 36-layer Transformer model trained on UniParc.
Fig. 3.
Fig. 3.
Final representations from trained models implicitly align sequences. Cosine similarity distributions are depicted for the final representations of residues from sequences within Pfam family PF01010. The differences between the aligned (dark blue) and unaligned (light blue) distributions imply that the trained Transformer representations are a powerful discriminator between aligned and unaligned positions in the sequences. In contrast, representations prior to training do not separate the aligned (dark red) and unaligned positions (light red). (A) Overall distribution; distribution under constraint that residue pairs have (B) same amino acid identity; or (C) different amino acid identities. AUCs across 128 Pfam families are reported in SI Appendix, Table S1.
Fig. 4.
Fig. 4.
Secondary structure (linear projections). Example predictions for held-out folds. Unsupervised pretraining encodes secondary structure into representations. Following pretraining, linear projections recover secondary structure (left column). Without pretraining, little information is recovered (right column). (A) d1nt4a_ Phosphoglycerate mutase-like fold; (B) d3wr7a_ Acyl-CoA N-acyltransferases fold. Colors indicate secondary structure class identified by the projection: helix (red), strand (green), and coil (blue). Color intensities indicate confidence. Representations from ESM-1b Transformer are used.
Fig. 5.
Fig. 5.
Residue–residue contacts (linear projections). (Left) Top-L predictions for fold level held-out example d1n3ya_, with vWA-like fold. True positives in blue, false positives in red, superimposed on ground truth contact map in gray. ESM-1b Transformer projections below the diagonal, CCMpred predictions above the diagonal. (Right) Precision distribution (top-L long-range) comparing ESM-1b projections with CCMpred across all domains in the five test partitions with structural holdout at the fold level. Visualized domain marked by ×.
Fig. 6.
Fig. 6.
Relationship between the language modeling objective and structure learning. Eight-class secondary structure prediction accuracy (Left) and contact prediction top-L long-range precision (Right) both as a function of pretraining ECE. Performance is evaluated on held-out folds. Linear projections are fit using model checkpoints over the course of pretraining on UR50/S. The linear relationship for each model indicates that for a given model and pretraining dataset, the language modeling ECE is a good proxy for the structural content of the representations. Improvement of the model’s ECE leads to an increase in information about structure. This establishes a link between the language modeling objective and unsupervised structure learning.
Fig. 7.
Fig. 7.
Representation learning enables state-of-the-art supervised prediction of the quantitative effect of mutations. (Left) Envision dataset (65). (Right) DeepSequence dataset (26). Transformer representations (34-layer, UR50/S) are compared to the LSTM bidirectional language model (large model, UR50/S). The result of fivefold cross validation is reported for each protein. For each partition, supervised fine-tuning is performed on 80% of the mutational data for the protein, and results are evaluated on the remaining 20%. Transformer representations outperform baseline LSTM representations on both datasets. State-of-the-art methods are also shown for each dataset. Gray et al. (65) is a supervised method using structural, evolutionary, and biochemical features, trained with the same protocol as used for the Transformer. Riesselman et al. (26) is an unsupervised method trained on the MSA of each protein. Mean and SD across the five partitions for Transformer model and LSTM baseline.

References

    1. Yanofsky C., Horn V., Thorpe D., Protein structure relationships revealed by mutational analysis. Science 146, 1593–1594 (1964). - PubMed
    1. Altschuh D., Lesk A. M., Bloomer A. C., Klug A., Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J. Mol. Biol. 193, 693–707 (1987). - PubMed
    1. Altschuh D., Vernet T., Berti P., Moras D., Nagai K., Coordinated amino acid changes in homologous protein families. Protein Eng. 2, 193–199 (1988). - PubMed
    1. Göbel U., Sander C., Schneider R., Valencia A., Correlated mutations and residue contacts in proteins. Proteins 18, 309–317 (1994). - PubMed
    1. Harris Z. S., Distributional structure. Word 10, 146–162 (1954).

Publication types

MeSH terms