A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Brian B Luczak, Benjamin T James, Hani Z Girgis

PMID: 29220512
PMCID: PMC6781583
DOI: 10.1093/bib/bbx161

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Brian B Luczak et al. Brief Bioinform. 2019.

. 2019 Jul 19;20(4):1222-1237.

doi: 10.1093/bib/bbx161.

Authors

Brian B Luczak, Benjamin T James, Hani Z Girgis

PMID: 29220512
PMCID: PMC6781583
DOI: 10.1093/bib/bbx161

Abstract

Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences.

Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.

Availability: The source code of the benchmarking tool is available as Supplementary Materials.

Keywords: DNA sequence comparison; alignment-free k-mer statistics; k-mer histograms; paired statistics.

PubMed Disclaimer

Figures

**Figure 1**
Alignment-free k-mer statistics grouped by statistical families. These families are based on a classification by Cha [17]. This figure provides a visual representation of several alignment-free k-mer statistics from their respective families. Each member of a statistical family shares a common functional element such as a histogram dot-product (Inner Product), minimum/maximum (Intersection), Markov model (Markov family) or overarching radical (Minkowski); although a variety of statistical families exist such as the $χ^{2}$ family, many recently developed methods do not fall into any specific category, e.g. N2, DMk and EMD.

**Figure 2**
Identity score distribution for the 135 p27 sequences in the sensitivity and specificity experiment. The data set consists of 26 similar sequences ( $\geq$ 70%) and 109 dissimilar sequences (<70%). There are no sequence pairs within 5% of the cutoff line (70%), meaning that classification will be slightly easier for each statistic because the separation between similar and dissimilar sequences is clear.

**Figure 3**
Sensitivity values on the p27 data, each statistic in the left-most third identified all positive sequences correctly (26 of 26). Few statistics such as LD and Markov are not ideal for filtering out sequences that have identity scores < 70%.

**Figure 4**
Top five statistics by linear correlation: 0, 60, 70, 80 and 90% threshold. All microbiome data points, i.e. sequence pairs, with identity score greater than the threshold value were used in these experiments. We then computed the ordinary linear correlation coefficient for each statistic. The best statistic pairs were chosen based on their r² values. Note that either EMD or LD appears in all top performing statistics.

**Figure 5**
Time to compute similarities for 4735 microbiome sequence pairs. The Minkowski family overall is the most time efficient, but others are comparable; any statistics that use conditional probability such as the Markov family take considerably longer.

See this image and copyright information in PMC

References

1. Needleman SB, Wunsch CD.. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443–53.10.1016/0022-2836(70)90057-4 - DOI - PubMed
1. Zhang Z, Schwartz S, Wagner L, et al. . A greedy algorithm for aligning DNA sequences. J Comput Biol 2000;7(1–2):203–14. - PubMed
1. Yano M, Mori H, Akiyama Y, et al. . CLAST: CUDA implemented large-scale alignment search tool. BMC Bioinformatics 2014;15:406.10.1186/s12859-014-0406-y - DOI - PMC - PubMed
1. Altschul SF, Gish W, Miller W, et al. . Basic alignment search tool. J Mol Biol 1990;215(3):403–10.10.1016/S0022-2836(05)80360-2 - DOI - PubMed
1. Sims GE, Jun SR, Wu GA, et al. . Alignment-free genome comparison with feature frequency profiles (ffp) and optimal resolutions. Proc Natl Acad Sci USA 2009;106(8):2677–82.10.1073/pnas.0813249106 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

Authors

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous