HomologMiner: looking for homologous genomic groups in whole genomes
- PMID: 17308341
- DOI: 10.1093/bioinformatics/btm048
HomologMiner: looking for homologous genomic groups in whole genomes
Abstract
Motivation: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain.
Results: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families.
Availability: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.
Similar articles
-
Tandem repeats over the edit distance.Bioinformatics. 2007 Jan 15;23(2):e30-5. doi: 10.1093/bioinformatics/btl309. Bioinformatics. 2007. PMID: 17237101
-
CGAT: a comparative genome analysis tool for visualizing alignments in the analysis of complex evolutionary changes between closely related genomes.BMC Bioinformatics. 2006 Oct 24;7:472. doi: 10.1186/1471-2105-7-472. BMC Bioinformatics. 2006. PMID: 17062155 Free PMC article.
-
Indel seeds for homology search.Bioinformatics. 2006 Jul 15;22(14):e341-9. doi: 10.1093/bioinformatics/btl263. Bioinformatics. 2006. PMID: 16873491
-
Discovering and detecting transposable elements in genome sequences.Brief Bioinform. 2007 Nov;8(6):382-92. doi: 10.1093/bib/bbm048. Epub 2007 Oct 10. Brief Bioinform. 2007. PMID: 17932080 Review.
-
Key-string algorithm--novel approach to computational analysis of repetitive sequences in human centromeric DNA.Croat Med J. 2003 Aug;44(4):386-406. Croat Med J. 2003. PMID: 12950141 Review.
Cited by
-
Identification of both copy number variation-type and constant-type core elements in a large segmental duplication region of the mouse genome.BMC Genomics. 2013 Jul 8;14:455. doi: 10.1186/1471-2164-14-455. BMC Genomics. 2013. PMID: 23834397 Free PMC article.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous