Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep 4;10(1):3977.
doi: 10.1038/s41467-019-11994-0.

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Affiliations

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints

Joe G Greener et al. Nat Commun. .

Abstract

The inapplicability of amino acid covariation methods to small protein families has limited their use for structural annotation of whole genomes. Recently, deep learning has shown promise in allowing accurate residue-residue contact prediction even for shallow sequence alignments. Here we introduce DMPfold, which uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. Applied to all Pfam domains without known structures, confident models for 25% of these so-called dark families were produced in under a week on a small 200 core cluster. DMPfold provides models for 16% of human proteome UniProt entries without structures, generates accurate models with fewer than 100 sequences in some cases, and is freely available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Examples of DMPfold models. In each case the model is shown in orange and the native structure, if available, is shown in blue. a CASP12 FM domains. b A membrane protein from the FILM3 set. c Pfam families with available structures, used as a validation set. d CASP13 FM target T1010-D1, where DMPfold produced the best model at CASP13 (native structure not public). e Models displaying novel folds for Pfam families without structures
Fig. 2
Fig. 2
DMPfold results on CASP12 FM domains compared to existing methods. a Distribution of TM-scores for the best of the top 5 models for each CASP12 FM domain. b Comparison of DMPfold and CONFOLD2 best of top 5 models. The dashed line indicates the point of equal quality models between the two methods, which both use CNS. c Similar to b but for Rosetta. d The change in TM-score and absolute distance error with DMPfold iterations for each domain. Domains are ordered by decreasing iteration 3 TM-score
Fig. 3
Fig. 3
Performance of DMPfold on transmembrane proteins. a Distribution of TM-scores for the FILM3 TMP dataset. One model is generated for each of the 28 proteins. The FILM3 results are the final refined models from the FILM3 paper. b The per-protein correlation of TM-scores. The dashed line indicates the point of equal quality models between the two methods
Fig. 4
Fig. 4
DMPfold run on Pfam families. a Number of Pfam families at each stage of the analysis. Each set is a subset of the previous set. b The TM-scores using TM-align of generated models to the native structure for the validation set of Pfam families with available structures not used for DMPfold training. c Overlap of high confidence models provided by DMPfold with two other studies that generated models for Pfam families. d Comparison of models after refinement provided by ref. with our models where a native structure is available. e Comparison of high confidence models provided by ref. with our models where a native structure is available. These are not the same families as in d
Fig. 5
Fig. 5
DMPfold predictions are robust to variations in MSA composition and sequence length. Evaluations are made on the Pfam validation set. a Correlation of TM-align score with alignment depth and effective sequence count Neff, defined in the Methods section. The Pearson correlation coefficients are shown. DMPfold is able to generate accurate models for some Pfam families with fewer than 100 sequences in the sequence alignment. b There is little correlation of model accuracy with target sequence length. c In order to compare with ref. , we calculated the Nf with their criteria and plotted the mean model accuracy of models in bins of Nf values. Nf values were calculated as described in ref. , where an 80% identity threshold was used for clustering. Values were read off the graph in Fig. 2 of ref. and added here. It is important to note that the proteins used to obtain our values were different to theirs. It can be seen that DMPfold is effective at lower effective sequence counts
Fig. 6
Fig. 6
An example of model accuracy increasing after iterations. The model is for Pfam family PF13642. a Distance maps of the initial prediction, the prediction after iterations and the native structure. In each case the value at i, j is the centre of the distance bin with the maximum likelihood between residues i and j. b The change of the absolute error in distance from the initial prediction to the prediction after iterations. A negative value indicates an improvement with the iterations. c The improvement is shown on the structure. The native structure (PDB ID 2L6O) is in blue, the initial model is in orange and the model after iterations is in green. The loop region indicated in red throughout, and the following helix, are closer to the native structure in the prediction after iterations than the initial prediction
Fig. 7
Fig. 7
Coverage of proteomes by DMPfold models. Left: Pie chart showing the fraction of Pfam-annotated amino acid residues in a number of proteomes for which templates or PDB matches are available (blue). Of the remaining residues, the fractions covered by high-confidence DMPfold models (red) and lower-confidence models (orange) are marked. Right: Pie chart of UniProt entries in several proteomes. Entries are first split into whether or not they are (at least partially) covered by PDB matches or templates. Within each split, we then assess the fraction of entries that either have or do not have high-confidence DMPfold models available. The green fraction indicates entries for which de novo models provide the only structural information currently available. Data for each fraction of each pie chart are summed over several proteomes (full data in Supplementary Tables 2 and 3)
Fig. 8
Fig. 8
Overview of the DMPfold pipeline. Initially inter-residue Cβ distances, H-bonds and torsion angles are predicted from DMP inputs. These are used to generate models with CNS, and a single model is used as additional input to refine the distances and H-bonds. After 3 iterations a final set of models is returned
Fig. 9
Fig. 9
DMPfold model architectures. DMPfold uses three predictors, all of which are deep, fully convolutional residual networks. Each uses a total of 18 residual blocks, comprising convolutional layers with a mixture of standard and dilated 5 × 5 filters. Where numbers are included in parentheses, these are the dimensions of the tensor output by the respective layer. For the iterative versions of the distance and H-bond predictors, the input tensor includes an extra feature channel composed of values taken from structures in the prior iteration (for a total of 502 channels). See the Methods section for full details

Similar articles

Cited by

References

    1. de Juan D, Pazos F, Valencia A. Emerging methods in protein co-evolution. Nat. Rev. Genet. 2013;14:249–261. doi: 10.1038/nrg3414. - DOI - PubMed
    1. Monastyrskyy B, D’Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Protein. Struct. Funct. Bioinf. 2015;84:131–144. doi: 10.1002/prot.24943. - DOI - PMC - PubMed
    1. Michel M, et al. PconsFold: improved contact predictions improve protein models. Bioinformatics. 2014;30:i482–i488. doi: 10.1093/bioinformatics/btu458. - DOI - PMC - PubMed
    1. Bender BJ, et al. Protocols for molecular modeling with Rosetta3 and RosettaScripts. Biochemistry. 2016;55:4748–4763. doi: 10.1021/acs.biochem.6b00444. - DOI - PMC - PubMed
    1. Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS ONE. 2014;9:e92197. doi: 10.1371/journal.pone.0092197. - DOI - PMC - PubMed

Publication types