. 2017 Jan 20;355(6322):294-298.

doi: 10.1126/science.aah4043.

Protein structure determination using metagenome sequence data

Sergey Ovchinnikov^{1

2

3}, Hahnbeom Park^{1

2}, Neha Varghese⁴, Po-Ssu Huang^{1

2}, Georgios A Pavlopoulos⁴, David E Kim^{1

5}, Hetunandan Kamisetty⁶, Nikos C Kyrpides^{4

7}, David Baker^{8

2

5}

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA 98105, USA.
² Institute for Protein Design, University of Washington, Seattle, WA 98105, USA.
³ Molecular and Cellular Biology Program, University of Washington, Seattle, WA 98195, USA.
⁴ Joint Genome Institute, Walnut Creek, CA 94598, USA.
⁵ Howard Hughes Medical Institute, University of Washington, Box 357370, Seattle, WA 98105, USA.
⁶ Facebook Inc., Seattle, WA 98109, USA.
⁷ Department of Biological Sciences, King Abdulaziz University, Jeddah, Saudi Arabia.
⁸ Department of Biochemistry, University of Washington, Seattle, WA 98105, USA. dabaker@u.washington.edu.

PMID: 28104891
PMCID: PMC5493203
DOI: 10.1126/science.aah4043

Protein structure determination using metagenome sequence data

Sergey Ovchinnikov et al. Science. 2017.

. 2017 Jan 20;355(6322):294-298.

doi: 10.1126/science.aah4043.

Authors

Affiliations

¹ Department of Biochemistry, University of Washington, Seattle, WA 98105, USA.
² Institute for Protein Design, University of Washington, Seattle, WA 98105, USA.
³ Molecular and Cellular Biology Program, University of Washington, Seattle, WA 98195, USA.
⁴ Joint Genome Institute, Walnut Creek, CA 94598, USA.
⁵ Howard Hughes Medical Institute, University of Washington, Box 357370, Seattle, WA 98105, USA.
⁶ Facebook Inc., Seattle, WA 98109, USA.
⁷ Department of Biological Sciences, King Abdulaziz University, Jeddah, Saudi Arabia.
⁸ Department of Biochemistry, University of Washington, Seattle, WA 98105, USA. dabaker@u.washington.edu.

PMID: 28104891
PMCID: PMC5493203
DOI: 10.1126/science.aah4043

Abstract

Despite decades of work by structural biologists, there are still ~5200 protein families with unknown structure outside the range of comparative modeling. We show that Rosetta structure prediction guided by residue-residue contacts inferred from evolutionary information can accurately model proteins that belong to large families and that metagenome sequence data more than triple the number of protein families with sufficient sequences for accurate modeling. We then integrate metagenome data, contact-based structure matching, and Rosetta structure calculations to generate models for 614 protein families with currently unknown structures; 206 are membrane proteins and 137 have folds not represented in the Protein Data Bank. This approach provides the representative models for large protein families originally envisioned as the goal of the Protein Structure Initiative at a fraction of the cost.

PubMed Disclaimer

Figures

**Fig. 1**
Comparison of Rosetta models (left) to subsequently published crystal structures (right). The models accurately recapitulate the structural details of A) the Cytochrome bd oxidase (TMalign score 0.88) B) the Lipoprotein signal peptidase II (TMalign score 0.70) C) the DMT superfamily transporter YddG (TMalign score 0.70) D) the Fluoride ion transporter dimer (TMalign score 0.69) E) the CASP11 target T0806 F) Prolipoprotein diacylglyceryl transferase (TMalign score 0.69) and G) Fumarate hydratase (TMalign score 0.80 for monomer (top) and 0.76 for dimer (bottom)).

**Fig. 2**
Metagenome data greatly increased fraction of structures which can be accurately modeled. A) Dependence of coevolution guided Rosetta structure prediction accuracy on the effective number of sequences Nf (a function of both sequence number and diversity; see Methods definition) in the protein family. For each of 27 proteins of known structure, the multiple sequence alignment was subsampled and residue-residue contacts predicted using GREMLIN. Rosetta structure prediction calculations were then used to generate ~20,000 models, and a single model was selected based on the Rosetta energy and the fit to the coevolution constraints; the average TMscore of these selected models over all 27 cases is shown on the y axis (dashed line). Hybridization based refinement of the top 20 models together with the top 10 *map_align* based models for each case increases the average accuracy (solid line); models with fold-level accuracy (TMscore > 0.5) are obtained for Nf ≥ 16, and models with accuracy typical of comparative modeling, for Nf of 64. B) Fraction of protein families of unknown structure with at least 64 Nf. Dashed line: including only sequences in UniRef100 database; solid line: including sequences in UniRef100 database together with metagenome sequence data from JGI (37). C) Distribution of Nf values for 5211 PFAM families with currently unknown structure, after the addition of metagenomic sequences; 25% of the protein-families have Nf > 64, 34% have Nf > 32 and 45% have Nf > 16.

**Fig. 3**
Representative structure models for selected PFAM families. Membrane proteins are on the top row; new folds on the bottom right. The multidomain models of the iron transporter and RNA helicase and the dimeric model of CobS, an enzyme in vitamin B synthesis, are guided by both intra- and inter-chain coevolution restraints.

See this image and copyright information in PMC

Comment in

Big-data approaches to protein structure prediction.
Söding J. Söding J. Science. 2017 Jan 20;355(6322):248-249. doi: 10.1126/science.aal4512. Science. 2017. PMID: 28104854 No abstract available.

References

1. Finn RD, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44:D279–85. - PMC - PubMed
1. Söding J. Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005;21:951–960. - PubMed
1. Montelione GT. The Protein Structure Initiative: achievements and visions for the future. F1000 Biol Rep. 2012;4:7. - PMC - PubMed
1. Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A. 2013;110:15674–15679. - PMC - PubMed
1. Marks DS, et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One. 2011;6:e28766. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Associated data

Dryad/10.5061/dryad.27p4s

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Protein structure determination using metagenome sequence data

Affiliations

Protein structure determination using metagenome sequence data

Authors

Affiliations

Abstract

Figures

Comment in

References

Publication types

MeSH terms

Substances

Associated data

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources