Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun 21;279(1):83-9.
doi: 10.1016/j.jtbi.2011.03.029. Epub 2011 Apr 2.

A novel clustering method via nucleotide-based Fourier power spectrum analysis

Affiliations

A novel clustering method via nucleotide-based Fourier power spectrum analysis

Bo Zhao et al. J Theor Biol. .

Abstract

A novel clustering method is proposed to classify genes or genomes. This method uses a natural representation of genomic data by binary indicator sequences of each nucleotide (adenine (A), cytosine (C), guanine (G), and thymine (T)). Afterwards, the discrete Fourier transform is applied to these indicator sequences to calculate spectra of the nucleotides. Mathematical moments are calculated for each of these spectra to create a multidimensional vector in a Euclidean space for each gene or genome sequence. Thus, each gene or genome sequence is realized as a geometric point in the Euclidean space. Finally, pairwise Euclidean distances between these points (i.e. genome sequences) are calculated to cluster the gene or genome sequences. This method is applied to three sets of data. The first is 34 strains of coronavirus genomic data, the second is 118 of the known strains of Human rhinovirus (HRV), and the third is 30 bacteria genomes. The distance matrices are computed based on the three sets, showing the distances from each point to the others. We used the complete linkage clustering algorithm to build phylogenetic trees to indicate how the distances among these sequence correspond to the evolutionary relationship among these sequences. This genome representation provides a powerful and efficient method to classify genomes and is much faster than the widely acknowledged multiple sequence alignment method.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
These phylogenetic trees show the 30 coronavirus genomes as well as the four outgroup genomes. Our method has split them into the correct groups: outgroup and Groups 1-5.
Fig. 2
Fig. 2
These phylogenetic trees show the 116 human rhinovirus genomes as well as the three outgroup genomes. Our method has split them into the correct groups (HRV-A, HRV-B, HRV-C, and HEV-C (outgroup)) in complete linkage clustering, but there are seven misplaced HRV-A genomes in the tree created by average linkage clustering.
Fig. 3
Fig. 3
These phylogenetic trees show the 30 bacteria genomes. Our method has split them into the correct families: Enterobacteriaceae, Staphylococcaceae, Rhodobacteriaceae, Burkholderiaceae, Bacilleceae, Spirochaetaceae, Clostridiaceae, and Desulfovibrionaceae.

References

    1. Brown N.P., Larkin M.A., Blackshields G. Clustal w and clustal x version 2.0. Bioinformatics. 2007;23(21):2947–2948. - PubMed
    1. Dawyndt P., De Meyer H., De Baets B. The complete linkage clustering algorithm revisited. Soft Computing. 2005;9(May):385–392.
    1. Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004;32(5):1792–1797. - PMC - PubMed
    1. Katoh K., Asimenos G., Toh H. Multiple alignment of dna sequences with MAFFT. Methods Molecular Biology. 2009;537:39–64. - PubMed
    1. Kryukov K., Saitou N. MISHIMA – a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data. BMC Bioinformatics. 2010;11(142) - PMC - PubMed

LinkOut - more resources