Estimating the total genome length of a metagenomic sample using k-mers

Kui Hua^{1

2}, Xuegong Zhang^{3

4

5}

Affiliations

¹ MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST, Beijing, 100084, China.
² Department of Automation, Tsinghua University, Beijing, 100084, China.
³ MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST, Beijing, 100084, China. zhangxg@tsinghua.edu.cn.
⁴ Department of Automation, Tsinghua University, Beijing, 100084, China. zhangxg@tsinghua.edu.cn.
⁵ School of Life Sciences, Tsinghua University, Beijing, 100084, China. zhangxg@tsinghua.edu.cn.

PMID: 30967110
PMCID: PMC6456951
DOI: 10.1186/s12864-019-5467-x

Estimating the total genome length of a metagenomic sample using k-mers

Kui Hua et al. BMC Genomics. 2019.

. 2019 Apr 4;20(Suppl 2):183.

doi: 10.1186/s12864-019-5467-x.

Authors

Kui Hua^{1

2}, Xuegong Zhang^{3

4

5}

Affiliations

¹ MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST, Beijing, 100084, China.
² Department of Automation, Tsinghua University, Beijing, 100084, China.
³ MOE Key Laboratory of Bioinformatics Division and Center for Synthetic & System Biology, BNRIST, Beijing, 100084, China. zhangxg@tsinghua.edu.cn.
⁴ Department of Automation, Tsinghua University, Beijing, 100084, China. zhangxg@tsinghua.edu.cn.
⁵ School of Life Sciences, Tsinghua University, Beijing, 100084, China. zhangxg@tsinghua.edu.cn.

PMID: 30967110
PMCID: PMC6456951
DOI: 10.1186/s12864-019-5467-x

Abstract

Background: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.

Results: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.

Conclusions: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.

Keywords: Distinct k-mers; Genome length; Metagenomics; Sequencing coverage.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Overview of the proposed method. a An illustration of understanding DNA sequence as a collection of k-mers. In this simple case, sequence length L=12, k=6 for the k-mer counting, *TKC*= L−k+1=7, *DKC*=5, *KRI*=*TKC*/*DKC*=1.2. b Relationships between metagenome, metagenomic sample and the set of distinct genomes in the metagenome. c Workflow of the proposed method

**Fig. 2**
Different microbial communities are simulated to test the performance of the proposed method. (a) Results for microbial communities with 10 species. The three histograms on the left show the abundance distributions of different simulated communities. The middle panel shows the estimation results of distinct k-mer count. Each bar represents an estimation result based on a synthetic metagenomic sample and the error bar shows the 95% bootstrap confidence interval of the estimation. The black dash line is the true distinct k-mer count. The right panel shows how the relative error goes as the initial coverage increases (k = 20). (b) The same as (a) except that the species number is 50. (Note that some of the samples with 10 species are not shown in the barplot, see Additional file 1: Figure S1 for all samples with 10 species)

**Fig. 3**
a Performance on metagenomic data with sequencing errors. b True and estimated K-mer Redundant Index (KRI) in different metagenomics communities. About 60% of the species are randomly chosen as the known species to estimate the KRI of all species. c Results of different selections of K. Simulated metagenomic sample with 50 speices and high complexity of the abundance distribution was used. d Results on HMP Tongue Dorsum datasets

**Fig. 4**
Results on T2D metagenomic datasets. a Observed and estimated k-mer count. b Histogram and density of the observed distinct k-mer count. c Histogram and density of the predicted distinct k-mer count

See this image and copyright information in PMC

Cited by

Enhancing Clinical Utility: Utilization of International Standards and Guidelines for Metagenomic Sequencing in Infectious Disease Diagnosis.
Kan CM, Tsang HF, Pei XM, Ng SSM, Yim AK, Yu AC, Wong SCC. Kan CM, et al. Int J Mol Sci. 2024 Mar 15;25(6):3333. doi: 10.3390/ijms25063333. Int J Mol Sci. 2024. PMID: 38542307 Free PMC article. Review.

References

1. Gordon JI. Honor thy gut symbionts redux. Science. 2012;336(6086):1251–3. doi: 10.1126/science.1224686. - DOI - PubMed
1. Falony G, Wijmenga C, Raes J, et al. Population-level analysis of gut microbiome variation. Science. 2016;352(6285):560–4. doi: 10.1126/science.aad3503. - DOI - PubMed
1. Zhernakova A, Wijmenga C, Fu J, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352(6285):565–9. doi: 10.1126/science.aad3369. - DOI - PMC - PubMed
1. Cui H, Li Y, Zhang X. An overview of major metagenomic studies on human microbiomes in health and disease. Quant Biol. 2016;4(3):192–206. doi: 10.1007/s40484-016-0078-x. - DOI
1. Zhang X, Liu S, Cui H, Chen T. Reading the underlying information from massive metagenomic sequencing data. Proc IEEE. 2017;105(3):459–73.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed