Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Aug 28;104(35):13913-8.
doi: 10.1073/pnas.0702636104. Epub 2007 Aug 23.

Quantitative assessment of protein function prediction from metagenomics shotgun sequences

Affiliations

Quantitative assessment of protein function prediction from metagenomics shotgun sequences

E D Harrington et al. Proc Natl Acad Sci U S A. .

Abstract

To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Flow chart of function prediction procedure. By using homology to genes in the KEGG, COG, and UniRef90 databases, ORFs were divided into four categories based on the level of functional annotation possible; (i) specific functional annotation: ORFs similar to genes with specific functional information; (ii) nonspecific functional annotation: ORFs similar to genes that have been characterized at a general level or low similarity; (iii) no functional annotation but member of an existing family: ORFs with homologs in one of the databases but no functional information (e.g., “conserved hypothetical”); (iv) singletons: ORFs that have no significant similarity to known sequences. ORFs containing domains from the SMART and Pfam A databases were upgraded to having nonspecific annotation where applicable. Finally genomic neighborhood methods were used to infer functional links between ORFs and upgrade the functional annotation accordingly.
Fig. 2.
Fig. 2.
Protein function prediction in genomes and metagenomes. Many proteins can be functionally characterized in both data sets. The degree of functional characterization for four metagenomic data sets is shown on the left and for 124 prokaryotic genomes on the right. The inner pie chart represents the level of functional characterization possible by using the homology-based approach. The middle ring shows the level of functional characterization possible by using neighborhood methods. The outer ring summarizes the combined level of characterization possible. Surprisingly, it implies that most metagenomic ORFs (83% of the data) can be functionally characterized, similar to the level possible in fully sequenced genomes.
Fig. 3.
Fig. 3.
Prediction of function in previously uncharacterized gene families by using genomic neighborhood. Whereas homology-based approaches quantify the known functions, neighborhood approaches reveal functional novelty, even in conjunction with well known processes. (a) A putative transmembrane protein belonging to an uncharacterized COG (COG1981 shown in red) that consistently cooccurs with members of the well characterized heme biosynthesis pathway (colored blue). The putative membrane-associated protein occurs on 174 distinct contigs in the surface sea water and whale fall data sets that can be grouped into at least 15 unique operon arrangements, strongly suggesting a role in this process. (b) A predicted putative regulator, shown in red, that links fatty acid biosynthesis (upstream, colored green) with fatty acid degradation (downstream, colored blue), a functional link not seen in fully sequenced genomes. The regulator appears on 20 distinct contigs in the sea water, of which there are at least five unique operon arrangements.
Fig. 4.
Fig. 4.
Dependence of functional characterization on family size. Colored bars in this histogram of gene families binned by size represent the proportion of families with specific functional annotation (if >20% of the members were classified as such; green) and no specific annotation (a combination of nonspecific and no functional annotation; red). Gray bars indicate average gene family size in that bin. Only two of 174,124 bins containing singletons are shown for clarity. Most large gene families have a known function, whereas many small families remain uncharacterized.

References

    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Nature. 2004;428:37–43. - PubMed
    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, et al. Science. 2004;304:66–74. - PubMed
    1. Hallam SJ, Putnam N, Preston CM, Detter JC, Rokhsar D, Richardson PM, DeLong EF. Science. 2004;305:1457–1462. - PubMed
    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, et al. Science. 2005;308:554–557. - PubMed
    1. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan MB, Edwards R, Brito BR, et al. Science. 2006;311:496–503. - PubMed

Publication types