Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Apr 10:9:182.
doi: 10.1186/1471-2105-9-182.

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Affiliations

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Shibu Yooseph et al. BMC Bioinformatics. .

Abstract

Background: The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.

Results: We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net).

Conclusion: The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flow chart of the incremental clustering method.
Figure 2
Figure 2
Percentage of Unrelated Pairs in Clusters. For all clusters, only Pfam match-containing sequences were considered. For the top curve (labeled All), all clusters with at least two Pfam match-containing sequences were considered where as for the second curve (labeled At least 5), only those clusters with at least five Pfam match-containing sequences were considered. For the later curve, it is seen that 94% of the reported clusters have no unrelated pairs. The bottom two curves show the trends for the "strict" version of unrelatedness.
Figure 3
Figure 3
Number of clusters that domain architectures appear in. For the bottom curve (labeled All), all domain architectures were considered whereas for the top curve a domain architecture is considered as appearing in a cluster only if it has at least five instances in that cluster. In both cases, nearly 61% of domain architectures appear in a single cluster, and over 80% of domain architectures appear in at most 3 clusters.
Figure 4
Figure 4
Log-Log Plot of Cluster Size Distribution. The x-axis is the logarithm of the cluster size C and the y-axis is the logarithm of the number of clusters of size ≥C. Logarithms are in base 10. The blue curve is the observed data, which is consistent with a power law. There is an inflection point around C = 2500 (a value of 3.4 on the x-axis). The two red lines are the least square fit to C ≤ 2500 and C > 2500, respectively. The former line is y = -0.733*x + 5.517, with R2 = 0.995, and the later line is y = -1.686*x + 8.813, with R2 = 0.992.

References

    1. Genome Project Statistic http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
    1. DeLong EF. Microbial community genomics in the ocean. Nat Rev Microbiol. 2005;3:459–469. doi: 10.1038/nrmicro1158. - DOI - PubMed
    1. Eisen JA. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol. 2007;5:e82. doi: 10.1371/journal.pbio.0050082. - DOI - PMC - PubMed
    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM. Comparative metagenomics of microbial communities. Science. 2005;308:554–557. doi: 10.1126/science.1107851. - DOI - PubMed
    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources