The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families
- PMID: 17355171
- PMCID: PMC1821046
- DOI: 10.1371/journal.pbio.0050016
The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families
Abstract
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Conflict of interest statement
Figures



















Comment in
-
Global ocean sampling collection.PLoS Biol. 2007 Mar;5(3):e83. doi: 10.1371/journal.pbio.0050083. PLoS Biol. 2007. PMID: 17355178 Free PMC article.
-
Untapped bounty: sampling the seas to survey microbial biodiversity.PLoS Biol. 2007 Mar;5(3):e85. doi: 10.1371/journal.pbio.0050085. Epub 2007 Mar 13. PLoS Biol. 2007. PMID: 20076663 Free PMC article. No abstract available.
Similar articles
-
Probing metagenomics by rapid cluster analysis of very large datasets.PLoS One. 2008;3(10):e3375. doi: 10.1371/journal.pone.0003375. Epub 2008 Oct 10. PLoS One. 2008. PMID: 18846219 Free PMC article.
-
The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific.PLoS Biol. 2007 Mar;5(3):e77. doi: 10.1371/journal.pbio.0050077. PLoS Biol. 2007. PMID: 17355176 Free PMC article.
-
The Sorcerer II Global Ocean Sampling Expedition: metagenomic characterization of viruses within aquatic microbial samples.PLoS One. 2008 Jan 23;3(1):e1456. doi: 10.1371/journal.pone.0001456. PLoS One. 2008. PMID: 18213365 Free PMC article.
-
The Landscape of Global Ocean Microbiome: From Bacterioplankton to Biofilms.Int J Mol Sci. 2023 Mar 30;24(7):6491. doi: 10.3390/ijms24076491. Int J Mol Sci. 2023. PMID: 37047466 Free PMC article. Review.
-
Metagenomics and the protein universe.Curr Opin Struct Biol. 2011 Jun;21(3):398-403. doi: 10.1016/j.sbi.2011.03.010. Epub 2011 Apr 14. Curr Opin Struct Biol. 2011. PMID: 21497084 Free PMC article. Review.
Cited by
-
A novel metatranscriptomic approach to identify gene expression dynamics during extracellular electron transfer.Nat Commun. 2013;4:1601. doi: 10.1038/ncomms2615. Nat Commun. 2013. PMID: 23511466
-
Ecology and physics of bacterial chemotaxis in the ocean.Microbiol Mol Biol Rev. 2012 Dec;76(4):792-812. doi: 10.1128/MMBR.00029-12. Microbiol Mol Biol Rev. 2012. PMID: 23204367 Free PMC article. Review.
-
Global diversity and biogeography of deep-sea pelagic prokaryotes.ISME J. 2016 Mar;10(3):596-608. doi: 10.1038/ismej.2015.137. Epub 2015 Aug 7. ISME J. 2016. PMID: 26251871 Free PMC article.
-
High diversity and potential origins of T4-type bacteriophages on the surface of Arctic glaciers.Extremophiles. 2013 Sep;17(5):861-70. doi: 10.1007/s00792-013-0569-x. Epub 2013 Aug 2. Extremophiles. 2013. PMID: 23907516
-
Marine picoplankton metagenomes and MAGs from eleven vertical profiles obtained by the Malaspina Expedition.Sci Data. 2024 Feb 1;11(1):154. doi: 10.1038/s41597-024-02974-1. Sci Data. 2024. PMID: 38302528 Free PMC article.
References
-
- Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. - PubMed
-
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, et al. CATH—A hierarchic classification of protein domain structures. Structure. 1997;5:1093–1108. - PubMed
-
- Thornton JM, Orengo CA, Todd AE, Pearl FM. Protein folds, functions and evolution. J Mol Biol. 1999;293:333–342. - PubMed
-
- Todd AE, Orengo CA, Thornton JM. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol. 2001;307:1113–1143. - PubMed
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases