Indel seeds for homology search
- PMID: 16873491
- DOI: 10.1093/bioinformatics/btl263
Indel seeds for homology search
Abstract
We are interested in detecting homologous genomic DNA sequences with the goal of locating approximate inverted, interspersed, and tandem repeats. Standard search techniques start by detecting small matching parts, called seeds, between a query sequence and database sequences. Contiguous seed models have existed for many years. Recently, spaced seeds were shown to be more sensitive than contiguous seeds without increasing the random hit rate. To determine the superiority of one seed model over another, a model of homologous sequence alignment must be chosen. Previous studies evaluating spaced and contiguous seeds have assumed that matches and mismatches occur within these alignments, but not insertions and deletions (indels). This is perhaps appropriate when searching for protein coding sequences (<5% of the human genome), but is inappropriate when looking for repeats in the majority of genomic sequence where indels are common. In this paper, we assume a model of homologous sequence alignment which includes indels and we describe a new seed model, called indel seeds, which explicitly allows indels. We present a waiting time formula for computing the sensitivity of an indel seed and show that indel seeds significantly outperform contiguous and spaced seeds when homologies include indels. We discuss the practical aspect of using indel seeds and finally we present results from a search for inverted repeats in the dog genome using both indel and spaced seeds.
Similar articles
-
HomologMiner: looking for homologous genomic groups in whole genomes.Bioinformatics. 2007 Apr 15;23(8):917-25. doi: 10.1093/bioinformatics/btm048. Epub 2007 Feb 18. Bioinformatics. 2007. PMID: 17308341
-
Tandem repeats over the edit distance.Bioinformatics. 2007 Jan 15;23(2):e30-5. doi: 10.1093/bioinformatics/btl309. Bioinformatics. 2007. PMID: 17237101
-
DNA assembly with gaps (Dawg): simulating sequence evolution.Bioinformatics. 2005 Nov 1;21 Suppl 3:iii31-8. doi: 10.1093/bioinformatics/bti1200. Bioinformatics. 2005. PMID: 16306390
-
Discovering and detecting transposable elements in genome sequences.Brief Bioinform. 2007 Nov;8(6):382-92. doi: 10.1093/bib/bbm048. Epub 2007 Oct 10. Brief Bioinform. 2007. PMID: 17932080 Review.
-
To detect and analyze sequence repeats whatever be their origin.Methods Mol Biol. 2012;859:69-90. doi: 10.1007/978-1-61779-603-6_4. Methods Mol Biol. 2012. PMID: 22367866 Review.
Cited by
-
Seeding with minimized subsequence.Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i232-i241. doi: 10.1093/bioinformatics/btad218. Bioinformatics. 2023. PMID: 37387132 Free PMC article.
-
Entropy predicts sensitivity of pseudorandom seeds.Genome Res. 2023 Jul;33(7):1162-1174. doi: 10.1101/gr.277645.123. Epub 2023 May 22. Genome Res. 2023. PMID: 37217253 Free PMC article.
-
Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.Algorithms Mol Biol. 2017 Feb 14;12:1. doi: 10.1186/s13015-017-0092-1. eCollection 2017. Algorithms Mol Biol. 2017. PMID: 28289437 Free PMC article.
-
PerFSeeB: designing long high-weight single spaced seeds for full sensitivity alignment with a given number of mismatches.BMC Bioinformatics. 2023 Oct 24;24(1):396. doi: 10.1186/s12859-023-05517-4. BMC Bioinformatics. 2023. PMID: 37875804 Free PMC article.
-
Efficient seeding for error-prone sequences with SubseqHash2.Bioinformatics. 2025 Aug 2;41(8):btaf418. doi: 10.1093/bioinformatics/btaf418. Bioinformatics. 2025. PMID: 40705438 Free PMC article.
MeSH terms
Substances
LinkOut - more resources
Full Text Sources