Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

doi:10.1038/s41467-019-11994-0

. 2019 Sep 4;10(1):3977.

doi: 10.1038/s41467-019-11994-0.

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Joe G Greener^{1

2}, Shaun M Kandathil^{1

2}, David T Jones^{3

4}

Affiliations

¹ Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
² The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK.
³ Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK. d.t.jones@ucl.ac.uk.
⁴ The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK. d.t.jones@ucl.ac.uk.

PMID: 31484923
PMCID: PMC6726615
DOI: 10.1038/s41467-019-11994-0

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Joe G Greener et al. Nat Commun. 2019.

. 2019 Sep 4;10(1):3977.

doi: 10.1038/s41467-019-11994-0.

Authors

Joe G Greener^{1

2}, Shaun M Kandathil^{1

2}, David T Jones^{3

4}

Affiliations

¹ Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
² The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK.
³ Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK. d.t.jones@ucl.ac.uk.
⁴ The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK. d.t.jones@ucl.ac.uk.

PMID: 31484923
PMCID: PMC6726615
DOI: 10.1038/s41467-019-11994-0

Abstract

The inapplicability of amino acid covariation methods to small protein families has limited their use for structural annotation of whole genomes. Recently, deep learning has shown promise in allowing accurate residue-residue contact prediction even for shallow sequence alignments. Here we introduce DMPfold, which uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. Applied to all Pfam domains without known structures, confident models for 25% of these so-called dark families were produced in under a week on a small 200 core cluster. DMPfold provides models for 16% of human proteome UniProt entries without structures, generates accurate models with fewer than 100 sequences in some cases, and is freely available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Examples of DMPfold models. In each case the model is shown in orange and the native structure, if available, is shown in blue. a CASP12 FM domains. b A membrane protein from the FILM3 set. c Pfam families with available structures, used as a validation set. d CASP13 FM target T1010-D1, where DMPfold produced the best model at CASP13 (native structure not public). e Models displaying novel folds for Pfam families without structures

**Fig. 2**
DMPfold results on CASP12 FM domains compared to existing methods. a Distribution of TM-scores for the best of the top 5 models for each CASP12 FM domain. b Comparison of DMPfold and CONFOLD2 best of top 5 models. The dashed line indicates the point of equal quality models between the two methods, which both use CNS. c Similar to b but for Rosetta. d The change in TM-score and absolute distance error with DMPfold iterations for each domain. Domains are ordered by decreasing iteration 3 TM-score

**Fig. 3**
Performance of DMPfold on transmembrane proteins. a Distribution of TM-scores for the FILM3 TMP dataset. One model is generated for each of the 28 proteins. The FILM3 results are the final refined models from the FILM3 paper. b The per-protein correlation of TM-scores. The dashed line indicates the point of equal quality models between the two methods

**Fig. 4**
DMPfold run on Pfam families. a Number of Pfam families at each stage of the analysis. Each set is a subset of the previous set. b The TM-scores using TM-align of generated models to the native structure for the validation set of Pfam families with available structures not used for DMPfold training. c Overlap of high confidence models provided by DMPfold with two other studies that generated models for Pfam families. d Comparison of models after refinement provided by ref. with our models where a native structure is available. e Comparison of high confidence models provided by ref. with our models where a native structure is available. These are not the same families as in d

**Fig. 5**
DMPfold predictions are robust to variations in MSA composition and sequence length. Evaluations are made on the Pfam validation set. a Correlation of TM-align score with alignment depth and effective sequence count N_eff, defined in the Methods section. The Pearson correlation coefficients are shown. DMPfold is able to generate accurate models for some Pfam families with fewer than 100 sequences in the sequence alignment. b There is little correlation of model accuracy with target sequence length. c In order to compare with ref. , we calculated the N_f with their criteria and plotted the mean model accuracy of models in bins of N_f values. N_f values were calculated as described in ref. , where an 80% identity threshold was used for clustering. Values were read off the graph in Fig. 2 of ref. and added here. It is important to note that the proteins used to obtain our values were different to theirs. It can be seen that DMPfold is effective at lower effective sequence counts

**Fig. 6**
An example of model accuracy increasing after iterations. The model is for Pfam family PF13642. a Distance maps of the initial prediction, the prediction after iterations and the native structure. In each case the value at i, j is the centre of the distance bin with the maximum likelihood between residues i and j. b The change of the absolute error in distance from the initial prediction to the prediction after iterations. A negative value indicates an improvement with the iterations. c The improvement is shown on the structure. The native structure (PDB ID 2L6O) is in blue, the initial model is in orange and the model after iterations is in green. The loop region indicated in red throughout, and the following helix, are closer to the native structure in the prediction after iterations than the initial prediction

**Fig. 7**
Coverage of proteomes by DMPfold models. Left: Pie chart showing the fraction of Pfam-annotated amino acid residues in a number of proteomes for which templates or PDB matches are available (blue). Of the remaining residues, the fractions covered by high-confidence DMPfold models (red) and lower-confidence models (orange) are marked. Right: Pie chart of UniProt entries in several proteomes. Entries are first split into whether or not they are (at least partially) covered by PDB matches or templates. Within each split, we then assess the fraction of entries that either have or do not have high-confidence DMPfold models available. The green fraction indicates entries for which de novo models provide the only structural information currently available. Data for each fraction of each pie chart are summed over several proteomes (full data in Supplementary Tables 2 and 3)

**Fig. 8**
Overview of the DMPfold pipeline. Initially inter-residue Cβ distances, H-bonds and torsion angles are predicted from DMP inputs. These are used to generate models with CNS, and a single model is used as additional input to refine the distances and H-bonds. After 3 iterations a final set of models is returned

**Fig. 9**
DMPfold model architectures. DMPfold uses three predictors, all of which are deep, fully convolutional residual networks. Each uses a total of 18 residual blocks, comprising convolutional layers with a mixture of standard and dilated 5 × 5 filters. Where numbers are included in parentheses, these are the dimensions of the tensor output by the respective layer. For the iterative versions of the distance and H-bond predictors, the input tensor includes an extra feature channel composed of values taken from structures in the prior iteration (for a total of 502 channels). See the Methods section for full details

See this image and copyright information in PMC

Cited by

Fast and accurate Ab Initio Protein structure prediction using deep learning potentials.
Pearce R, Li Y, Omenn GS, Zhang Y. Pearce R, et al. PLoS Comput Biol. 2022 Sep 16;18(9):e1010539. doi: 10.1371/journal.pcbi.1010539. eCollection 2022 Sep. PLoS Comput Biol. 2022. PMID: 36112717 Free PMC article.
A Novel Protein from Ectocarpus sp. Improves Salinity and High Temperature Stress Tolerance in Arabidopsis thaliana.
Rathor P, Borza T, Stone S, Tonon T, Yurgel S, Potin P, Prithiviraj B. Rathor P, et al. Int J Mol Sci. 2021 Feb 17;22(4):1971. doi: 10.3390/ijms22041971. Int J Mol Sci. 2021. PMID: 33671243 Free PMC article.
DEMO2: Assemble multi-domain protein structures by coupling analogous template alignments with deep-learning inter-domain restraint prediction.
Zhou X, Peng C, Zheng W, Li Y, Zhang G, Zhang Y. Zhou X, et al. Nucleic Acids Res. 2022 Jul 5;50(W1):W235-W245. doi: 10.1093/nar/gkac340. Nucleic Acids Res. 2022. PMID: 35536281 Free PMC article.
A Vaccine Construction against COVID-19-Associated Mucormycosis Contrived with Immunoinformatics-Based Scavenging of Potential Mucoralean Epitopes.
Naveed M, Ali U, Karobari MI, Ahmed N, Mohamed RN, Abullais SS, Kader MA, Marya A, Messina P, Scardina GA. Naveed M, et al. Vaccines (Basel). 2022 Apr 22;10(5):664. doi: 10.3390/vaccines10050664. Vaccines (Basel). 2022. PMID: 35632420 Free PMC article.
AI in health and medicine.
Rajpurkar P, Chen E, Banerjee O, Topol EJ. Rajpurkar P, et al. Nat Med. 2022 Jan;28(1):31-38. doi: 10.1038/s41591-021-01614-0. Epub 2022 Jan 20. Nat Med. 2022. PMID: 35058619 Review.

See all "Cited by" articles

References

1. de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 2013;14:249–261. doi: 10.1038/nrg3414. - DOI - PubMed
1. Monastyrskyy B, D’Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Protein. Struct. Funct. Bioinf. 2015;84:131–144. doi: 10.1002/prot.24943. - DOI - PMC - PubMed
1. Michel M, et al. PconsFold: improved contact predictions improve protein models. Bioinformatics. 2014;30:i482–i488. doi: 10.1093/bioinformatics/btu458. - DOI - PMC - PubMed
1. Bender BJ, et al. Protocols for molecular modeling with Rosetta3 and RosettaScripts. Biochemistry. 2016;55:4748–4763. doi: 10.1021/acs.biochem.6b00444. - DOI - PMC - PubMed
1. Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS ONE. 2014;9:e92197. doi: 10.1371/journal.pone.0092197. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 2013;14:249–261. doi: 10.1038/nrg3414. - DOI - PubMed

[2] de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 2013;14:249–261. doi: 10.1038/nrg3414. - DOI - PubMed

[3] Monastyrskyy B, D’Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Protein. Struct. Funct. Bioinf. 2015;84:131–144. doi: 10.1002/prot.24943. - DOI - PMC - PubMed

[4] Monastyrskyy B, D’Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Protein. Struct. Funct. Bioinf. 2015;84:131–144. doi: 10.1002/prot.24943. - DOI - PMC - PubMed

[5] Michel M, et al. PconsFold: improved contact predictions improve protein models. Bioinformatics. 2014;30:i482–i488. doi: 10.1093/bioinformatics/btu458. - DOI - PMC - PubMed

[6] Michel M, et al. PconsFold: improved contact predictions improve protein models. Bioinformatics. 2014;30:i482–i488. doi: 10.1093/bioinformatics/btu458. - DOI - PMC - PubMed

[7] Bender BJ, et al. Protocols for molecular modeling with Rosetta3 and RosettaScripts. Biochemistry. 2016;55:4748–4763. doi: 10.1021/acs.biochem.6b00444. - DOI - PMC - PubMed

[8] Bender BJ, et al. Protocols for molecular modeling with Rosetta3 and RosettaScripts. Biochemistry. 2016;55:4748–4763. doi: 10.1021/acs.biochem.6b00444. - DOI - PMC - PubMed

[9] Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS ONE. 2014;9:e92197. doi: 10.1371/journal.pone.0092197. - DOI - PMC - PubMed

[10] Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS ONE. 2014;9:e92197. doi: 10.1371/journal.pone.0092197. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Affiliations

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources