Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2012 Jul 23:13:174.
doi: 10.1186/1471-2105-13-174.

A novel hierarchical clustering algorithm for gene sequences

Affiliations
Comparative Study

A novel hierarchical clustering algorithm for gene sequences

Dan Wei et al. BMC Bioinformatics. .

Abstract

Background: Clustering DNA sequences into functional groups is an important problem in bioinformatics. We propose a new alignment-free algorithm, mBKM, based on a new distance measure, DMk, for clustering gene sequences. This method transforms DNA sequences into the feature vectors which contain the occurrence, location and order relation of k-tuples in DNA sequence. Afterwards, a hierarchical procedure is applied to clustering DNA sequences based on the feature vectors.

Results: The proposed distance measure and clustering method are evaluated by clustering functionally related genes and by phylogenetic analysis. This method is also compared with BlastClust, CD-HIT-EST and some others. The experimental results show our method is effective in classifying DNA sequences with similar biological characteristics and in discovering the underlying relationship among the sequences.

Conclusions: We introduced a novel clustering algorithm which is based on a new sequence similarity measure. It is effective in classifying DNA sequences with similar biological characteristics and in discovering the relationship among the sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The phylogenetic trees for 10 species using the full DNA sequences of β-globin.
Figure 2
Figure 2
The distribution of F-measure as a function of the number of clusters based on the k-tuple distance (The real numbers of DS1, DS2, DS3 and DS4 are 8, 6, 6, and 6, respectively).
Figure 3
Figure 3
The distribution of F-measure as a function of the number of clusters based on DMk (The real numbers of DS1, DS2, DS3 and DS4 are 8, 6, 6, and 6, respectively).
Figure 4
Figure 4
The phylogenetic trees for 10 species using the full DNA sequences of β-globin.
Figure 5
Figure 5
The phylogenetic trees for 60 H1N1 viruses.
Figure 6
Figure 6
The time comparison of three methods.
Figure 7
Figure 7
The relationship between the runtime and different numbers of sequences and length of sequences.

References

    1. Demuth JP, De Bie T, Stajich JE, Cristianini N, Hahn MW. The evolution of mammalian gene families. PLoS One. 2006;1:1–10. doi: 10.1371/journal.pone.0000001. - DOI - PMC - PubMed
    1. Zhao B, Duan V, Yau SS. A novel clustering method via nucleotide-based Fourier power spectrum analysis. JTheor Biol. 2011;279:83–89. doi: 10.1016/j.jtbi.2011.03.029. - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. A basic local alignment search tool. JMol Biol. 1990;215:403–410. - PubMed
    1. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. ProcNatlAcad Sci USA. 1988;85(8):2444–2488. doi: 10.1073/pnas.85.8.2444. - DOI - PMC - PubMed
    1. Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics. 2003;19(4):513–523. doi: 10.1093/bioinformatics/btg005. - DOI - PubMed

Publication types