Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2009 Oct 28:10:359.
doi: 10.1186/1471-2105-10-359.

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Affiliations
Comparative Study

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Weizhong Li. BMC Bioinformatics. .

Abstract

Background: The remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic datasets (metagenomes) are large collections of sequencing reads from anonymous species within particular environments. Computational analyses for very large metagenomes are extremely time-consuming, and there are often many novel sequences in these metagenomes that are not fully utilized. The number of available metagenomes is rapidly increasing, so fast and efficient metagenome comparison methods are in great demand.

Results: The new metagenomic data analysis method Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was developed using an ultra-fast sequence clustering algorithm, fast protein family annotation tools, and a novel statistical metagenome comparison method that employs a unique graphic interface. RAMMCAP processes extremely large datasets with only moderate computational effort. It identifies raw read clusters and protein clusters that may include novel gene families, and compares metagenomes using clusters or functional annotations calculated by RAMMCAP. In this study, RAMMCAP was applied to the two largest available metagenomic collections, the "Global Ocean Sampling" and the "Metagenomic Profiling of Nine Biomes".

Conclusion: RAMMCAP is a very fast method that can cluster and annotate one million metagenomic reads in only hundreds of CPU hours. It is available from http://tools.camera.calit2.net/camera/rammcap/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Metagenomic data analysis pipeline RAMMCAP.
Figure 2
Figure 2
Comparison between Rodriguez-Brito's method and z test method. The x-axis and y-axis are occurrence rate PA and PB of two samples A and B. The 4 plots are made with different combination of sample size NA and NB as indicated in each plot. Red lines and green lines are calculated with Rodriguez-Brito's method and z test method respectively. Difference of A and B outside the area enclosed by a pair of red (or green) lines is statistically significant at 0.95 confidence level. This figure shows that when PA and PB become big enough (such as >0.001), a very small difference between them will be counted as significant.
Figure 3
Figure 3
Distribution of clusters and sequences by cluster size. The x-axis is the cluster size X. The y-axis in left figures (a and c) is the number of clusters of size at least X; the y-axis in right figures (b and d) is the percentage of total sequences included in the clusters of size at least X. Clustering analyses were also made separately for the microbiomes and the viromes. So, together there are seven clustering experiments: GOS-ORF, BIOME-read, BIOME-ORF, BIOME-read-M, BIOME-read-V, BIOME-ORF-M, and BIOME-ORF-V (where M and V stand for microbiomes and viromes).
Figure 4
Figure 4
Distribution of clusters of Pfam sequences. The x-axis is cluster size. The y-axis in (a) is the number of sequences, and the y-axis in (b) is the number of clusters.
Figure 5
Figure 5
Similarity matrices of metagenomes. Squares along the diagonal represent the number of clusters where a sample occurs. Grayscale squares below the diagonal represent the occurrence profile coefficients rAB between two samples with a darker color indicating a greater similarity. Cells above the diagonal show the unique and overlapping clusters, explained in (c). Hierarchical clustering of samples based on the matrix is shown with vertical gridlines indicating the value of the coefficient where two nodes are merged. Matrices are made for GOS ORF clusters (a) and BIOME ORF clusters (b) with significant a factor f = 2 at 0.95 confidence level. The BIOME samples are grouped by biome type, such as Coral-M, which stands for coral microbiomes sample.
Figure 6
Figure 6
Similarity matrices of metagenomes based on families of two COG classes. Matrices are for GOS on COG class F (a), GOS on class T (b), BIOME on F (c), and BIOME on T (d) respectively with significant a factor f = 2 at 0.95 confidence level. Because GOS samples are microbial marine samples, only the microbial (non-viral) water samples from the BIOME data was used. Further, a representative subset from GOS samples was selected so that the figures of GOS and BIOME are similar in size.

Similar articles

Cited by

References

    1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007;5:e77. - PMC - PubMed
    1. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, et al. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biol. 2007;5:e16. - PMC - PubMed
    1. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–1359. - PMC - PubMed
    1. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan MB, Edwards R, Brito BR, et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science. 2006;311:496–503. - PubMed
    1. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, et al. The marine viromes of four oceanic regions. PLoS Biol. 2006;4:e368. - PMC - PubMed

Publication types

LinkOut - more resources