Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 15;17(4):e1008798.
doi: 10.1371/journal.pcbi.1008798. eCollection 2021 Apr.

Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families

Affiliations

Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families

Claudio Bassot et al. PLoS Comput Biol. .

Abstract

Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein's structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist

Figures

Fig 1
Fig 1. Repeats proteins classification.
Representation of the repeats classes and subclasses as classified in repeatsDB 2.0 [14].
Fig 2
Fig 2. The precision of contact predictions.
Positive Predictive Value (PPV) for the GaussDCA (red), Pconsc4 (Blue), DeepMetaPsicov (green), and trRosetta (orange).
Fig 3
Fig 3. The precision of contact predictions of trRosetta for the three datasets.
Results are shown for the three datasets, in blue the single unit dataset, in red the double units dataset, and in green complete region dataset.
Fig 4
Fig 4. GaussDCA, PconsC4, DeepMetaPsicov, and trRosetta contact maps.
Contact map for predictions obtained with GaussDCA, PconsC4, DeepMetaPsicov and trRosetta. In grey, the real contacts from the structure, in green, the corrected predicted contacts, and the falsely predicted contacts in red.
Fig 5
Fig 5. The relation between Precision and the effective number of sequences in the MSA.
Positively Predicted Value for trRosetta in orange, GaussDCA in red, PconsC4 in Blue and DeepMetaPsicov in green on the Neff value (the effective number of sequences length weighted with the length of the protein). The single dots correspond to each protein in the datasets, and the line is the running average on (n = 50).
Fig 6
Fig 6. Predicted contacts analysis.
a) Examples of inter- and intra- unit contacts. b) In red, the PPV for intra-units contacts in blue PPV for inter-units contacts predicted by DeepMetaPsicov. The lines are the respective running average of the PPV over the ratio of inter-unit contacts on the total of the protein contacts. c) In red, the PPV for intra-units contacts in blue PPV for inter-units contacts predicted by trRosetta. The lines are the respective running average of the PPV over the ratio of inter-unit contacts on the total of the protein contacts.
Fig 7
Fig 7. Protein model quality.
TM-score for the subfamilies; Models from trRosetta in orange, PconsC4 in blue and DeepMetaPsicov in green.
Fig 8
Fig 8. TM-score versus QA methods.
a) TM-score versus Pcons-score for complete region models generated with trRosetta. b) TM-score versus QmeanDisCo score for full region models created from trRosetta contacts.
Fig 9
Fig 9
a) Real TM-score versus Random Forest Predicted TM-score for complete region models generated with trRosetta. b) Pearson correlation coefficient between the TM-score and the QA methods.
Fig 10
Fig 10. Comparison between the contact-based model and homology modelling.
The superposition between the contact-based model (red) and the homology model (blue) and respective TM-score.
Fig 11
Fig 11. Selected models.
The different protein units are coloured in red and blue. a) SPW, b) SPW in red the “SPW” motif c) Curlin d) UCH-protein are shown and e) Xin repeat.

Similar articles

Cited by

References

    1. Heringa J. Detection of internal repeats: how common are they? Curr Opin Struct Biol. 1998;8: 338–345. 10.1016/s0959-440x(98)80068-7 - DOI - PubMed
    1. Strand M, Prolla TA, Liskay RM, Petes TD. Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature. 1993;365: 274–276. 10.1038/365274a0 - DOI - PubMed
    1. Pâques F, Leung W-Y, Haber JE. Expansions and Contractions in a Tandem Repeat Induced by Double-Strand Break Repair. Molecular and Cellular Biology. 1998. pp. 2045–2054. 10.1128/mcb.18.4.2045 - DOI - PMC - PubMed
    1. Schaper E, Gascuel O, Anisimova M. Deep conservation of human protein tandem repeats within the eukaryotes. Mol Biol Evol. 2014;31: 1132–1148. 10.1093/molbev/msu062 - DOI - PMC - PubMed
    1. Marcotte E.M., Pellegrini M., Yeates T.O., Eisenberg D. A census of protein repeats. J Mol Biol. 1999;293: 151–160. 10.1006/jmbi.1999.3136 - DOI - PubMed

Publication types

LinkOut - more resources