Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan 2;9(1):e84348.
doi: 10.1371/journal.pone.0084348. eCollection 2014.

Comparison of metatranscriptomic samples based on k-tuple frequencies

Affiliations

Comparison of metatranscriptomic samples based on k-tuple frequencies

Ying Wang et al. PLoS One. .

Abstract

Background: The comparison of samples, or beta diversity, is one of the essential problems in ecological studies. Next generation sequencing (NGS) technologies make it possible to obtain large amounts of metagenomic and metatranscriptomic short read sequences across many microbial communities. De novo assembly of the short reads can be especially challenging because the number of genomes and their sequences are generally unknown and the coverage of each genome can be very low, where the traditional alignment-based sequence comparison methods cannot be used. Alignment-free approaches based on k-tuple frequencies, on the other hand, have yielded promising results for the comparison of metagenomic samples. However, it is not known if these approaches can be used for the comparison of metatranscriptome datasets and which dissimilarity measures perform the best.

Results: We applied several beta diversity measures based on k-tuple frequencies to real metatranscriptomic datasets from pyrosequencing 454 and Illumina sequencing platforms to evaluate their effectiveness for the clustering of metatranscriptomic samples, including three d2-type dissimilarity measures, one dissimilarity measure in CVTree, one relative entropy based measure S2 and three classical 1p-norm distances. Results showed that the measure d2(S) can achieve superior performance on clustering metatranscriptomic samples into different groups under different sequencing depths for both 454 and Illumina datasets, recovering environmental gradients affecting microbial samples, classifying coexisting metagenomic and metatranscriptomic datasets, and being robust to sequencing errors. We also investigated the effects of tuple size and order of the background Markov model. A software pipeline to implement all the steps of analysis is built and is available at http://code.google.com/p/d2-tools/.

Conclusions: The k-tuple based sequence signature measures can effectively reveal major groups and gradient variation among metatranscriptomic samples from NGS reads. The d2(S) dissimilarity measure performs well in all application scenarios and its performance is robust with respect to tuple size and order of the Markov model.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Geographical distribution of 11 communities in our study.
There are 92 samples from 12 marine communities used in our study. ‘SWGE’, the Dataset 10 in Table 1, were collected from different locations with two research cruises in the Equatorial North Atlantic ocean and South Pacific Subtropical gyre. The locations of the other 11 communities are marked on the above map (using the DatasetID from Table 1), where we can find that Datasets 1,2,3,9,12 are collected from nearby locations.
Figure 2
Figure 2. The reference tree of the four communities in Experiment 1 (without branch length information).
The four communities are located at four distinct geographical locations with clear clustering characteristics. For the Georgia data, there are two control and two PUT (Putrescine) experimental and two SPD (Spermidine) experimental datasets.
Figure 3
Figure 3. Clustering results of the four distinctive communities in Experiment 1 based on d 2 s|M0 and k = 6.
d 2 s|M0 indicates using dissimilarity measure based on 0-th order Markov chain model. All the basic clusters for the four communities are correct. For the sub-classes in the Georgia communities, except for the two control samples, the SPD and PUT sub-classes are clustered correctly.
Figure 4
Figure 4. Average symmetric difference scores for the four distinct communities under different sampling rates in Experiment 1.
(A) is the symmetric difference scores for complete data as a function of tuple size k for different dissimilarity measures. (B) (C) (D) are the average symmetric difference scores as a function of tuple size k for different dissimilarity measures after 100 random samplings for 10%, 1% and 0.1% sampling rates, respectively. The lower the score is, the closer the clustering results and reference tree are. It is clear that d 2 s shows the best performance under most of the conditions.
Figure 5
Figure 5. Clustering results of four communities under 0.1% sampling rate based on d 2 s|M0 and k = 6 in Experiment 1.
d 2 s|M0 indicates using dissimilarity measure based on 0-th order Markov chain model. d 2 scan still cluster the four basic communities correctly, but cannot distinguish the subgroups in the Georgia community well.
Figure 6
Figure 6. Clustering results of the 92 datasets from 12 communities based on d 2 s
|M0 and k = 6 in Experiment 1. d 2 s|M0 indicates using dissimilarity measure d 2 s based on 0-th order Markov chain model. The dissimilarity measure d 2 s can cluster most basic communities and subgroup control and amended samples correctly, validating the effectiveness of d 2 s.
Figure 7
Figure 7. The PCoA ordinates of the NPSG data are primarily driven by the collection depth in Experiment 2.
(A) is the two dimensional PCoA plots of the samples based on the dissimilarity measure d 2 s for 0-th order Markov model and k = 10 (setting with highest SRCC); (B) is the clustering tree with d 2 s for 0-th order Markov model and k = 10.
Figure 8
Figure 8. Clustering results of the metagenomic and metatranscriptomic datasets from the NPSG community in Experiment 3.
(A) (B) are the clustering results using d 2 s the complete data under d 2 s with k = 7 and d 2* with k = 6 under 2nd order Markov chain model; (C) and (D) are the clustering results based on average dissimilarity using d 2 s with k = 7 under 0-th order Markov chain model and Hao with k = 6 based on 100 times of 1% sampling from the original data; (E) and (F) are the clustering results based on average dissimilarity using d 2 s when k = 4 and k = 6 for the 0-th order Markov chain based on 100 times of 0.1% sampling from the original data.
Figure 9
Figure 9. Clustering results of the Western English Channel based on d 2 s|M2 and k = 8 in Experiment 3.
The datasets contains the metagenomic and metatranscriptomic samples collected from different times. d 2 s|M2 indicates using dissimilarity measure d 2 sbased on 2nd order Markov chain model. As shown in the parentheses after each data, there are clear diurnal variation and season patterns for MT and MG clustering, respectively.
Figure 10
Figure 10. The reference tree of the mouse datasets in Experiment 4.
The seven samples are clustered according to their tissue types. The sample ID, such as ‘NOD504ColQN’, where the digital numbers, ‘504’, are the mouse ID, ‘Cec’ means cecum and ‘Col’ means colon, ‘QN’ means Qiagen-based protocol.
Figure 11
Figure 11. Clustering results of the mouse datasets based on d 2 s|M0 and k = 4 in Experiment 4.
d 2 s|M0 indicates using dissimilarity measure d 2 s based on 0-th order Markov chain model. Clusters for the four cecum samples are correct. For the three colon samples, two of them are clustered correctly, while the other one is merged at last.
Figure 12
Figure 12. Average symmetric difference scores for the mouse datasets under different sampling rates in Experiment 4.
(A) is the symmetric difference scores as a function of tuple size k for different dissimilarity measures based on the complete data. (B), (C) and (D) are the average symmetric difference scores as a function of tuple size k for different dissimilarity measures based on 100 random samplings of 1%, 0.1% and 0.01% sampling rates, respectively. The lower the score is, the closer the clustering results and reference tree is. It is clear that d 2 s shows best performance under most of the conditions.

References

    1. Lozupone C, Lladser M, Knights D, Stombaugh J, Knight R (2007) UniFrac: an effective distance metric for microbial community comparison. ISME J 5: 169–172. - PMC - PubMed
    1. Smith T, Waterman M (1981) Comparison of biosequences. Adv Appl Math 2: 482–489.
    1. Altschul S, Gish W, Miller W, Myers E, Lipman Dea (1990) Basic local alignment search tool. J Mol Biol 215: 403–410. - PubMed
    1. Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, et al. (2009) Community-wide analysis of microbial genome sequence signatures. Genome Biol 10: 85. - PMC - PubMed
    1. Dick GJ, Clement BG, Webb SM, Fodrie FJ, Bargar JR, et al. (2009) Enzymatic microbial Mn oxidation in the Guaymas Basin deep-sea hydrothermal plume. Geochim Cosmochim Ac 73: 6517–6530.

Publication types