The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Guan-Da Huang¹, Xue-Mei Liu¹, Tian-Lai Huang¹, Li-C Xia²

Affiliations

¹ School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China.
² Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.

PMID: 31508512
PMCID: PMC6723412
DOI: 10.1016/j.synbio.2019.08.001

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Guan-Da Huang et al. Synth Syst Biotechnol. 2019.

. 2019 Aug 31;4(3):150-156.

doi: 10.1016/j.synbio.2019.08.001. eCollection 2019 Sep.

Authors

Guan-Da Huang¹, Xue-Mei Liu¹, Tian-Lai Huang¹, Li-C Xia²

Affiliations

¹ School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China.
² Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.

PMID: 31508512
PMCID: PMC6723412
DOI: 10.1016/j.synbio.2019.08.001

Erratum in

Erratum regarding previously published articles.
[No authors listed] [No authors listed] Synth Syst Biotechnol. 2020 Oct 14;5(4):332. doi: 10.1016/j.synbio.2020.10.004. eCollection 2020 Dec. Synth Syst Biotechnol. 2020. PMID: 33102828 Free PMC article.

Abstract

Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics $T_{s u m}^{S}$ and $T_{s u m}^{*}$ , which subsample metagenome contigs by their representative regions, and summarize the regional $D_{2}^{S}$ and $D_{2}^{*}$ metrics by their upper bounds. We systematically studied the aggregative statistics' power at different k-mer size using simulations. Our analysis showed that, in general, the power of $T_{s u m}^{S}$ and $T_{s u m}^{*}$ increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of $T_{s u m}^{S}$ and $T_{s u m}^{*}$ was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.

Keywords: Alignment-free sequence comparison; Horizontal gene transfer; Statistical power; k-mer.

PubMed Disclaimer

Figures

**Fig. 1**
The $T_{s u m}$ resampling scheme. (On Seq $X$ and Seq $Y$ , F₁ to F_N are the subsampled fragments, and G is gap. $X_{i}^{S}$ is the maxima of $D_{2}^{S}$ between the ith subsampled fragment of $X$ and all subsampled fragments of $Y$ . The same for $Y_{i}^{S}$ is the maxima of $D_{2}^{S}$ between the ith subsampled fragment of $Y$ and all subsampled fragments of $X$ .)

**Fig. 2**
Simulating the foreground sequence pairs using the Horizontal Gene Transfer (HGT) procedure with motif length L = 5.

**Fig. 3**
Statistical power of $T_{s u m}^{S}$ and $T_{s u m}^{*}$ when the coverage rate R = 25%, 50%, 75%. (Two statistics $T_{sum}^{S}$ and $T_{sum}^{*}$ are used in this figure. Full length is the length of whole sequence, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in *n.i.i.d* (gc-rich) model.).

**Fig. 4**
Statistical power of $T_{s u m}^{S}$ and $T_{s u m}^{*}$ when k = 4, 5, 6, 7, 8. (Two statistics $T_{sum}^{S}$ and $T_{sum}^{*}$ are used in this figure. E is the entire genome segment length, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in *n.i.i.d* (gc-rich) model.)

**Fig. 5**
Statistical power of $T_{s u m}^{S}$ and $T_{s u m}^{*}$ when E = 400, 600, 800, 1000. (Two statistics $T_{sum}^{S}$ and $T_{sum}^{*}$ are used in this figure. E is the entire genome segment length, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in *n.i.i.d* (gc-rich) model.)

See this image and copyright information in PMC

References

1. Doolittle W.F. Phylogenetic classification and the universal tree. Science. 1999;284:2124–2129. - PubMed
1. Burrus V., Waldor M.K. Shaping bacterial genomes with integrative and conjugative elements. Res Microbiol. 2004;155:376–386. - PubMed
1. Frost L.S., Leplae R., Summers A.O., Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol. 2005;3:722–732. - PubMed
1. Kelly B.G., Vespermann A., Bolton D.J. The role of horizontal gene transfers in the evolution of selected foodborne bacterial pathogens. Food Chem Toxicol. 2009;47:951–968. - PubMed
1. Andersson J.O. Lateral gene transfer in eukaryotes. Cell Mol Life Sci. 2005;62:1182–1197. - PMC - PubMed

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Affiliations

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Authors

Affiliations

Erratum in

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous