Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar 2;6(3):e17293.
doi: 10.1371/journal.pone.0017293.

A novel method of characterizing genetic sequences: genome space with biological distance and applications

Affiliations

A novel method of characterizing genetic sequences: genome space with biological distance and applications

Mo Deng et al. PLoS One. .

Erratum in

  • PLoS One. 2011;6(3). doi: 10.1371/annotation/22351496-73dc-4205-9d9a-95a821ae74ca

Abstract

Background: Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences.

Methodology: To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists' analyses.

Conclusions: Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Genome analysis.
We apply our method to analyze 59 influenza viruses based on their whole genomes. The natural vector and the hierarchical clustering methods are used to reconstruct the phylogenetic tree for nucleotide sequences of the whole genome sequences of selected influenza viruses. The selected viruses are chosen to be representative from among all available relevant sequences in GenBank. Sequences have both high and low divergence to avoid biasing the distribution of branch lengths. Strains are representative of the major gene lineages from different hosts. The robustness of individual nodes of the tree is assessed using a bootstrap resampling analysis with 1000 replicates shown in Supporting Information S1. From this figure, we can clearly see that new influenza A (H1N1) viruses originate from North American triple-reassortant swine virus and Eurasian classical swine virus lineage. We note that (A/swine/Nakhon pathom/NIAH586-1/2005(H3N2)), (A/duck/Nanchang/4-165/2000(H4N6)) and American avian (A/blue-winged teal/Ohio/1864/2006(H3N8)) are not clustered with A (H1N1) genomes from the same geographical regions respectively. This result is caused by the different structures of these genomes and the traditional A (H1N1) subtypes. In addition, we check the distance matrix of these genomes obtained by natural vectors and the result shows that (A/duck/Nanchang/4-165/2000(H4N6)) is the closest to A/duck/NY/185502/2002(H5N2). Meanwhile, A/blue-winged teal/Ohio/1864/2006(H3N8) is the closest to A/chicken/Korea/ES/03(H5N1) and A/egret/Hong Kong/757.2/2003(H5N1) respectively, which means that A/blue-winged teal/Ohio/1864/2006(H3N8) is evolutionary related with H5N1 avian virus outbreak in Asian countries from 2003 to 2006. As for A/swine/Nakhon pathom/NIAH586-1/2005(H3N2), it is the closest to A/swine/Tianjin/01/2004(H1N1) and then to A/swine/Ontario/55383/04(H1N2) with and A/swine/OH/511445/2007(H1N1). This H3N2 is the closest related to Eurasian swine even if it is clustered within American swine clade (The large distance matrix data is not shown and available upon request).
Figure 2
Figure 2. The Natural vector method is used for clustering the HRV genome virus at the whole genome level.
All HRV data are provided in and the corresponding details are described in Supporting Information S1. This figure shows relationships between all known HRV serotypes created on the basis of full genome sequences. The HEV-B, C sequences are used as outgroups. The five clusters listed around the circular tree, HRV-C, HRV-B, HRV-A, HEV-B and HEV-C are separated clearly (HEV-B, C are outgroups) by using MEGA software . This clustering result is the same as Palmenberg et al's result shown in figure S6a in their paper . This method only needs 18 seconds to obtain this clustering result while it takes more than 19 hours for the multiple alignment method on the same dataset.
Figure 3
Figure 3. Genome analysis on 31 mammalian mitochondrial genomes.
We applied our method to analyze 31 mammalian mitochondrial genomes. From our clustering analysis, we can see that all 31 genomes are correctly clustered into 7 known clusters: Erinaceomorpha (cluster 1), Primates (cluster 2), Carnivore (cluster 3), Perissodactyla (cluster 4), Cetacea and Artiodactyla (cluster 5), Lagomorpha (cluster 6), Rodentia (cluster 7). Data are provided in Table 1. For the primates and carnivores subgroups, the clades are a little different from those obtained by using mitochondrial DNA coding sequences. In this experiment, we use the whole genome sequences containing all tRNA, sRNA, polypeptide-encoding genes and D-loop rather than mtDNA coding sequences, which may lead slightly different results. In fact, the distance matrix obtained by natural vectors shows that human is the closest to c.chimpanzee and p.chimpanzee with the distance of 994123.7 and 1346597.8 respectively, while giant panda is the closest to black bear with the distance of 2468063, although they are not clustered together.
Figure 4
Figure 4. Phylogenetic trees are reconstructed by using the maximum likelihood (ML) alignment method with Jukes-Cantor model, the neighbor-joining (NJ) method with the Kimura 2 parameter model and with the Jukes-Cantor model.
It is clear that swine flu viruses are not clustered correctly using the ML method (figure 4(a)). The NJ method with the Kimura and Jukes-Cantor models yields totally different phylogenetic trees. The Kimura model fails to distinguish the origin of A H1N1 virus since A H1N1 genomes are all very far away from other genomes (figure 4(b)), while the Jukes-Cantor model fails to cluster swine flu viruses correctly(figure 4(c)). The data is described in Supporting Information S1.

References

    1. Amano K, Nakamura H. Self-organizing clustering: a novel non-hierarchical method for clustering large amount of DNA sequences. Genome Inform. 2003;14:575–576.
    1. Emrich SJ, Kalyanaraman A, Aluru S. Aluru S, editor. Algorithms for large-scale clustering and assembly of biological sequence data. Handbook of Computational Molecular Biology. 2006. pp. 13.1–13.30.
    1. FitzGerald PC, Shlyakhtenko A, Mir A, Vinson C. Clustering of DNA sequences in human promoters. Genome Res. 2004;14:1562–1574. - PMC - PubMed
    1. Waterman SM. Introduction to computational biology: maps, sequences and genomes. Boca Raton: Chapman & Hall/CRC Press; 1995. 431
    1. Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, et al. Informatics for unveiling hidden genome signatures. Genome Research. 2003;13:693–702. - PMC - PubMed