Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences

Wei Wang¹, Jack Smith¹, Hussein A Hejase², Kevin J Liu¹

Affiliations

¹ 1Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 USA.
² 2Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA.

PMID: 32322294
PMCID: PMC7164268
DOI: 10.1186/s13015-020-00167-0

Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences

Wei Wang et al. Algorithms Mol Biol. 2020.

. 2020 Apr 16:15:7.

doi: 10.1186/s13015-020-00167-0. eCollection 2020.

Authors

Wei Wang¹, Jack Smith¹, Hussein A Hejase², Kevin J Liu¹

Affiliations

¹ 1Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 USA.
² 2Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA.

PMID: 32322294
PMCID: PMC7164268
DOI: 10.1186/s13015-020-00167-0

Abstract

Non-parametric and semi-parametric resampling procedures are widely used to perform support estimation in computational biology and bioinformatics. Among the most widely used methods in this class is the standard bootstrap method, which consists of random sampling with replacement. While not requiring assumptions about any particular parametric model for resampling purposes, the bootstrap and related techniques assume that sites are independent and identically distributed (i.i.d.). The i.i.d. assumption can be an over-simplification for many problems in computational biology and bioinformatics. In particular, sequential dependence within biomolecular sequences is often an essential biological feature due to biochemical function, evolutionary processes such as recombination, and other factors. To relax the simplifying i.i.d. assumption, we propose a new non-parametric/semi-parametric sequential resampling technique that generalizes "Heads-or-Tails" mirrored inputs, a simple but clever technique due to Landan and Graur. The generalized procedure takes the form of random walks along either aligned or unaligned biomolecular sequences. We refer to our new method as the SERES (or "SEquential RESampling") method. To demonstrate the performance of the new technique, we apply SERES to estimate support for the multiple sequence alignment problem. Using simulated and empirical data, we show that SERES-based support estimation yields comparable or typically better performance compared to state-of-the-art methods.

Keywords: Bootstrap; Multiple sequence alignment; Non-parametric; Random walk; Resampling; Semi-parametric; Statistical support.

PubMed Disclaimer

Conflict of interest statement

Competing interestsThe authors declare that they have no competing interests.

Figures

**Fig. 1**
Illustrated example of SERES resampling random walk on unaligned sequences. Detailed pseudocode is provided in Additional file 1: Additional methods section: Algorithm 2. a The resampling procedure begins with the estimation of a consensus alignment on the input set of unaligned sequences. b A set of conservative anchors is then obtained using the consensus alignment, and anchor boundaries define a set of barriers (including two trivial barriers—one at the start of the sequences and one at the end of the sequences). c The SERES random walk is conducted on the set of barriers. The walk begins at a random barrier and proceeds in a random direction to the neighboring barrier. The walk reverses with certainty when the trivial start/end barriers are encountered; furthermore, the walk direction can randomly reverse with probability $γ$ . As the walk proceeds from barrier to barrier, unaligned sequences are sampled between neighboring barrier pairs. d The resampling procedure terminates when the resampled sequences meet a specified sequence length threshold

**Fig. 2**
SERES + GUIDANCE2 performance using different choices for anchor length. Results are shown for five 10-taxon medium-gap-length model conditions (named 10.A through 10.E in order of generally increasing sequence divergence). We evaluated the performance of SERES + GUIDANCE2 where anchor length in bp was either 3, 5, 10, 30, or 50. We calculated each method’s precision-recall (PR) and receiver operating characteristic (ROC) curves. Performance is evaluated based upon aggregate area under curve (AUC) across all replicates for a model condition ( $n = 20$ )

**Fig. 3**
SERES + GUIDANCE2 performance using different choices for the number of anchors. We evaluated the performance of SERES + GUIDANCE2 where the number of anchors used was either 3, 5, 20, 50, or 100. Otherwise, figure layout and description are identical to Fig. 2

See this image and copyright information in PMC

References

1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57(1):289–300.
1. Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L. Fast statistical alignment. PLoS Comput Biol. 2009;5(5):e1000,392. doi: 10.1371/journal.pcbi.1000392. - DOI - PMC - PubMed
1. Cannone J, Subramanian S, Schnare M, Collett J, D’Souza L, Du Y, Feng B, Lin N, Madabusi L, Muller K, Pande N, Shang Z, Yu N, Gutell R. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal intron and other RNAs. BMC Bioinform. 2002;3(1):2. doi: 10.1186/1471-2105-3-2. - DOI - PMC - PubMed
1. Daskalakis C, Roch S. Alignment-free phylogenetic reconstruction. In: Berger B, editor. Research in computational molecular biology. Heidelberg: Springer; 2010. pp. 123–137.
1. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–845. doi: 10.2307/2531595. - DOI - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences

Affiliations

Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources