Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 27:3:160081.
doi: 10.1038/sdata.2016.81.

Next generation sequencing data of a defined microbial mock community

Affiliations

Next generation sequencing data of a defined microbial mock community

Esther Singer et al. Sci Data. .

Abstract

Generating sequence data of a defined community composed of organisms with complete reference genomes is indispensable for the benchmarking of new genome sequence analysis methods, including assembly and binning tools. Moreover the validation of new sequencing library protocols and platforms to assess critical components such as sequencing errors and biases relies on such datasets. We here report the next generation metagenomic sequence data of a defined mock community (Mock Bacteria ARchaea Community; MBARC-26), composed of 23 bacterial and 3 archaeal strains with finished genomes. These strains span 10 phyla and 14 classes, a range of GC contents, genome sizes, repeat content and encompass a diverse abundance profile. Short read Illumina and long-read PacBio SMRT sequences of this mock community are described. These data represent a valuable resource for the scientific community, enabling extensive benchmarking and comparative evaluation of bioinformatics tools without the need to simulate data. As such, these data can aid in improving our current sequence data analysis toolkit and spur interest in the development of new tools.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Characteristics of MBARC-26 community.
Community members display diversity in phylogenetic distribution and relatedness (a), genome size (b), GC content (c), and repeat content normalized by genome size (d). Shades of the same color in (a) denote the same phylum association: Green—Proteobacteria, blue—Actinobacteria, purple—Firmicutes, yellow—Euryarchaeota.
Figure 2
Figure 2. MBARC-26 community composition and relative abundance distribution, as based on Illumina and PacBio read mapping and mean DNA molarity.
Mock community members are grouped and arranged in order of % mapped sequences (Illumina). The observed discrepancy between molarity and % mapped PacBio and Illumina sequences in T. composti is likely due to contamination as T. composti was previously found to occur as laboratory contaminant in various shotgun metagenome datasets (unpublished data). The smaller discrepancies are expected due to DNA quantification spreads and platform biases. Colors denote phylum association as defined in Fig. 1.
Figure 3
Figure 3. Quantitative comparison of MBARC-26 Illumina and PacBio shotgun sequence datasets.
(a) Community representation according to % mapped sequences for each mock community member in the PacBio (x-axis) and Illumina (y-axis) shotgun sequence datasets. (b) Percent chromosome coverage and fold coverage of each mock community genome by sequencing platform using unassembled sequences. Colors denote phylum association as defined in Fig. 1.

Dataset use reported in

  • doi: 10.1186/s12864-015-2063-6
  • doi: 10.1038/ismej.2015.249
  • doi: 10.1093/bioinformatics/btw144

References

Data Citations

    1. 2016. NCBI Sequence Read Archive. SRX1836716
    1. 2016. NCBI Sequence Read Archive. SRX1836715

References

    1. Edgar R. C., Haas B. J., Clemente J. C., Quince C. & Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27, 2194–2200 (2011). - PMC - PubMed
    1. Willner D. et al. Comparison of DNA Extraction Methods for Microbial Community Profiling with an Application to Pediatric Bronchoalveolar Lavage Samples. PLoS ONE 7, e34605 (2012). - PMC - PubMed
    1. Haas B. J. et al. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research 21, 494–504 (2011). - PMC - PubMed
    1. Miller C. S., Baker B. J., Thomas B. C., Singer S. W. & Banfield J. F. EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biology 12, R44 (2011). - PMC - PubMed
    1. Kozich J. J., Westcott S. L., Baxter N. T., Highlander S. K. & Schloss P. D. Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform. Appl. Environ. Microbiol. 79, 5112–5120 (2013). - PMC - PubMed