Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 31;4(3):150-156.
doi: 10.1016/j.synbio.2019.08.001. eCollection 2019 Sep.

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Affiliations

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

Guan-Da Huang et al. Synth Syst Biotechnol. .

Erratum in

  • Erratum regarding previously published articles.
    [No authors listed] [No authors listed] Synth Syst Biotechnol. 2020 Oct 14;5(4):332. doi: 10.1016/j.synbio.2020.10.004. eCollection 2020 Dec. Synth Syst Biotechnol. 2020. PMID: 33102828 Free PMC article.

Abstract

Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics T s u m S and T s u m * , which subsample metagenome contigs by their representative regions, and summarize the regional D 2 S and D 2 * metrics by their upper bounds. We systematically studied the aggregative statistics' power at different k-mer size using simulations. Our analysis showed that, in general, the power of T s u m S and T s u m * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of T s u m S and T s u m * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.

Keywords: Alignment-free sequence comparison; Horizontal gene transfer; Statistical power; k-mer.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The Tsum resampling scheme. (On Seq X and Seq Y, F1 to FN are the subsampled fragments, and G is gap. XiS is the maxima of D2S between the ith subsampled fragment of X and all subsampled fragments of Y. The same for YiS is the maxima of D2S between the ith subsampled fragment of Y and all subsampled fragments of X.)
Fig. 2
Fig. 2
Simulating the foreground sequence pairs using the Horizontal Gene Transfer (HGT) procedure with motif length L = 5.
Fig. 3
Fig. 3
Statistical power of TsumS and Tsum* when the coverage rate R = 25%, 50%, 75%. (Two statistics TsumS and Tsum* are used in this figure. Full length is the length of whole sequence, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in n.i.i.d (gc-rich) model.).
Fig. 4
Fig. 4
Statistical power of TsumS and Tsum* when k = 4, 5, 6, 7, 8. (Two statistics TsumS and Tsum* are used in this figure. E is the entire genome segment length, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in n.i.i.d (gc-rich) model.)
Fig. 5
Fig. 5
Statistical power of TsumS and Tsum* when E = 400, 600, 800, 1000. (Two statistics TsumS and Tsum* are used in this figure. E is the entire genome segment length, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in n.i.i.d (gc-rich) model.)

References

    1. Doolittle W.F. Phylogenetic classification and the universal tree. Science. 1999;284:2124–2129. - PubMed
    1. Burrus V., Waldor M.K. Shaping bacterial genomes with integrative and conjugative elements. Res Microbiol. 2004;155:376–386. - PubMed
    1. Frost L.S., Leplae R., Summers A.O., Toussaint A. Mobile genetic elements: the agents of open source evolution. Nat Rev Microbiol. 2005;3:722–732. - PubMed
    1. Kelly B.G., Vespermann A., Bolton D.J. The role of horizontal gene transfers in the evolution of selected foodborne bacterial pathogens. Food Chem Toxicol. 2009;47:951–968. - PubMed
    1. Andersson J.O. Lateral gene transfer in eukaryotes. Cell Mol Life Sci. 2005;62:1182–1197. - PMC - PubMed

LinkOut - more resources