Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 19;20(4):1222-1237.
doi: 10.1093/bib/bbx161.

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Brian B Luczak et al. Brief Bioinform. .

Abstract

Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.

Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.

Availability: The source code of the benchmarking tool is available as Supplementary Materials.

Keywords: DNA sequence comparison; alignment-free k-mer statistics; k-mer histograms; paired statistics.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Alignment-free k-mer statistics grouped by statistical families. These families are based on a classification by Cha [17]. This figure provides a visual representation of several alignment-free k-mer statistics from their respective families. Each member of a statistical family shares a common functional element such as a histogram dot-product (Inner Product), minimum/maximum (Intersection), Markov model (Markov family) or overarching radical (Minkowski); although a variety of statistical families exist such as the χ2 family, many recently developed methods do not fall into any specific category, e.g. N2, DMk and EMD.
Figure 2
Figure 2
Identity score distribution for the 135 p27 sequences in the sensitivity and specificity experiment. The data set consists of 26 similar sequences (70%) and 109 dissimilar sequences (<70%). There are no sequence pairs within 5% of the cutoff line (70%), meaning that classification will be slightly easier for each statistic because the separation between similar and dissimilar sequences is clear.
Figure 3
Figure 3
Sensitivity values on the p27 data, each statistic in the left-most third identified all positive sequences correctly (26 of 26). Few statistics such as LD and Markov are not ideal for filtering out sequences that have identity scores < 70%.
Figure 4
Figure 4
Top five statistics by linear correlation: 0, 60, 70, 80 and 90% threshold. All microbiome data points, i.e. sequence pairs, with identity score greater than the threshold value were used in these experiments. We then computed the ordinary linear correlation coefficient for each statistic. The best statistic pairs were chosen based on their r2 values. Note that either EMD or LD appears in all top performing statistics.
Figure 5
Figure 5
Time to compute similarities for 4735 microbiome sequence pairs. The Minkowski family overall is the most time efficient, but others are comparable; any statistics that use conditional probability such as the Markov family take considerably longer.

References

    1. Needleman SB, Wunsch CD.. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443–53.10.1016/0022-2836(70)90057-4 - DOI - PubMed
    1. Zhang Z, Schwartz S, Wagner L, et al. . A greedy algorithm for aligning DNA sequences. J Comput Biol 2000;7(1–2):203–14. - PubMed
    1. Yano M, Mori H, Akiyama Y, et al. . CLAST: CUDA implemented large-scale alignment search tool. BMC Bioinformatics 2014;15:406.10.1186/s12859-014-0406-y - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, et al. . Basic alignment search tool. J Mol Biol 1990;215(3):403–10.10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Sims GE, Jun SR, Wu GA, et al. . Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proc Natl Acad Sci USA 2009;106(8):2677–82.10.1073/pnas.0813249106 - DOI - PMC - PubMed

Publication types

MeSH terms