WindowMasker: window-based masker for sequenced genomes
- PMID: 16287941
- DOI: 10.1093/bioinformatics/bti774
WindowMasker: window-based masker for sequenced genomes
Abstract
Motivation: Matches to repetitive sequences are usually undesirable in the output of DNA database searches. Repetitive sequences need not be matched to a query, if they can be masked in the database. RepeatMasker/Maskeraid (RM), currently the most widely used software for DNA sequence masking, is slow and requires a library of repetitive template sequences, such as a manually curated RepBase library, that may not exist for newly sequenced genomes.
Results: We have developed a software tool called WindowMasker (WM) that identifies and masks highly repetitive DNA sequences in a genome, using only the sequence of the genome itself. WM is orders of magnitude faster than RM because WM uses a few linear-time scans of the genome sequence, rather than local alignment methods that compare each library sequence with each piece of the genome. We validate WM by comparing BLAST outputs from large sets of queries applied to two versions of the same genome, one masked by WM, and the other masked by RM. Even for genomes such as the human genome, where a good RepBase library is available, searching the database as masked with WM yields more matches that are apparently non-repetitive and fewer matches to repetitive sequences. We show that these results hold for transcribed regions as well. WM also performs well on genomes for which much of the sequence was in draft form at the time of the analysis.
Availability: WM is included in the NCBI C++ toolkit. The source code for the entire toolkit is available at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools++/CURRENT/. Once the toolkit source is unpacked, the instructions for building WindowMasker application in the UNIX environment can be found in file src/app/winmasker/README.build.
Supplementary information: Supplementary data are available at ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf
Similar articles
-
RBR: library-less repeat detection for ESTs.Bioinformatics. 2006 Sep 15;22(18):2232-6. doi: 10.1093/bioinformatics/btl368. Epub 2006 Jul 12. Bioinformatics. 2006. PMID: 16837527
-
PhyloGena--a user-friendly system for automated phylogenetic annotation of unknown sequences.Bioinformatics. 2007 Apr 1;23(7):793-801. doi: 10.1093/bioinformatics/btm016. Epub 2007 Mar 1. Bioinformatics. 2007. PMID: 17332025
-
HomologMiner: looking for homologous genomic groups in whole genomes.Bioinformatics. 2007 Apr 15;23(8):917-25. doi: 10.1093/bioinformatics/btm048. Epub 2007 Feb 18. Bioinformatics. 2007. PMID: 17308341
-
DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity.Appl Bioinformatics. 2003;2(2):103-12. Appl Bioinformatics. 2003. PMID: 15130826 Review.
-
Fast masking of repeated primer binding sites in eukaryotic genomes.Methods Mol Biol. 2007;402:201-18. doi: 10.1007/978-1-59745-528-2_10. Methods Mol Biol. 2007. PMID: 17951797 Review.
Cited by
-
The blackgrass genome reveals patterns of non-parallel evolution of polygenic herbicide resistance.New Phytol. 2023 Mar;237(5):1891-1907. doi: 10.1111/nph.18655. Epub 2023 Jan 12. New Phytol. 2023. PMID: 36457293 Free PMC article.
-
DROMPA: easy-to-handle peak calling and visualization software for the computational analysis and validation of ChIP-seq data.Genes Cells. 2013 Jul;18(7):589-601. doi: 10.1111/gtc.12058. Epub 2013 May 15. Genes Cells. 2013. PMID: 23672187 Free PMC article.
-
Targeted Sequencing Reveals Large-Scale Sequence Polymorphism in Maize Candidate Genes for Biomass Production and Composition.PLoS One. 2015 Jul 7;10(7):e0132120. doi: 10.1371/journal.pone.0132120. eCollection 2015. PLoS One. 2015. PMID: 26151830 Free PMC article.
-
Interpretable deep residual network uncovers nucleosome positioning and associated features.Nucleic Acids Res. 2024 Aug 27;52(15):8734-8745. doi: 10.1093/nar/gkae623. Nucleic Acids Res. 2024. PMID: 39036965 Free PMC article.
-
A chromosome-level genome assembly reveals genomic characteristics of the American mink (Neogale vison).Commun Biol. 2022 Dec 16;5(1):1381. doi: 10.1038/s42003-022-04341-5. Commun Biol. 2022. PMID: 36526733 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous