Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Nov 9:5:63.
doi: 10.1186/1471-2148-5-63.

Exploration of phylogenetic data using a global sequence analysis method

Affiliations

Exploration of phylogenetic data using a global sequence analysis method

Charles Chapus et al. BMC Evol Biol. .

Abstract

Background: Molecular phylogenetic methods are based on alignments of nucleic or peptidic sequences. The tremendous increase in molecular data permits phylogenetic analyses of very long sequences and of many species, but also requires methods to help manage large datasets.

Results: Here we explore the phylogenetic signal present in molecular data by genomic signatures, defined as the set of frequencies of short oligonucleotides present in DNA sequences. Although violating many of the standard assumptions of traditional phylogenetic analyses--in particular explicit statements of homology inherent in character matrices--the use of the signature does permit the analysis of very long sequences, even those that are unalignable, and is therefore most useful in cases where alignment is questionable. We compare the results obtained by traditional phylogenetic methods to those inferred by the signature method for two genes: RAG1, which is easily alignable, and 18S RNA, where alignments are often ambiguous for some regions. We also apply this method to a multigene data set of 33 genes for 9 bacteria and one archea species as well as to the whole genome of a set of 16 gamma-proteobacteria. In addition to delivering phylogenetic results comparable to traditional methods, the comparison of signatures for the sequences involved in the bacterial example identified putative candidates for horizontal gene transfers.

Conclusion: The signature method is therefore a fast tool for exploring phylogenetic data, providing not only a pretreatment for discovering new sequence relationships, but also for identifying cases of sequence evolution that could confound traditional phylogenetic analysis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Signature distance as a function of sequence identity. Distances obtained from 5 kb sequences. (6 letter-words, Euclidian metric). Each point represents the mean of 100 sequence comparisons. The standard deviation of each point is shown.
Figure 2
Figure 2
Dynamics of signature distance matrices. Distance matrices were obtained from the RAG1 vertebrate study (see below). There are two types of criteria: metric (for example Vaf, stress) and topological (Arboricity, rate of well designed quadruples, rate of elementary quadruples). Vaf (variance accounted for): quadratic difference divided by the variance of distance. Rate of well designed quadruples: quadruples having the same topology according the two distance matrices; Rate of elementary quadruples, Arboricity; see [26]. On the y-axis, the criteria values obtained from the method of distance are plotted. For the stress, this value is indicated also by a dot line.
Figure 3
Figure 3
Robinson-Foulds distance analysis of trees. The distances were computed from trees of the RAG1 study (see below). For each world length between 1 and 10, a signature tree was computed and compared to the NJ, ML and random trees. For comparison of random trees and signature trees, 100 random trees were built. In this latter case, the dT is approximately 86 (the maximum value possible with this number of species). As a reference, dT between the NJ and ML trees is plotted as dashed line. The dT of the n-/(n+1)-letter word trees was computed for the Euclidean and χ2 metrics.
Figure 4
Figure 4
Phylogeny of vertebrate species. Three methods were applied to the RAG1 gene from 46 species. Distance method: alignment with ClustalW, (Kimura 2-parameter distance), reconstruction by NJ algorithm. MP: use of same alignment. PAUP* has been used with default parameters. Signature method: 6-letter words – χ2 metric. The tree is inferred by NJ method. The bootstrap coefficients for distance and signature method are indicated.
Figure 5
Figure 5
Phylogenetic tree of plants obtained by comparison of 18S rRNA signatures. (6-letter words – χ2 metric). The bootstrap coefficients (500 sets) of principal groups are indicated. The species class names are indexed by a code: A – Angiosperm, C – Conifer, G – Gnetale, Cyca – Cycad, F – Fern, M – Moss, L – Lycophyte, Lw – Liverwort, Hw – Hornwort. (see annex for the correspondence code/species).
Figure 6
Figure 6
Hierarchical classification of 393 6-letter word signatures. The signatures of a given species have the same color code. For each species group, the name of the species is indicated at left. The EF-Tu gene that also forms a stable group is also highlighted. Finally, arrows point out the horizontal transfer (HT) candidates that are discussed in this article.
Figure 7
Figure 7
Detailed view of the hierarchical classification of 393 6-letter word signatures. A detail focusing on the group with E. coli, S. Typhimurium and V. cholerae is shown. The symbols on the left of the names indicate the genes analyzed.
Figure 8
Figure 8
Consensus trees for ten species. The four methods shown are the signature (6-letter words – χ2 metric) method, distance method, MP and ML. For each method except ML, the bootstrap coefficients (100 sets) are indicated.
Figure 9
Figure 9
Dissimilarity distances between the consensus tree and the sets of genes retained. The dT distances have been computed for the method of distance, ML, MP and signature methods (6-letter word and χ2 metric).
Figure 10
Figure 10
A- Tree of γ-proteobacteria obtained from the MP method for the 16S rRNA sequences. Each color corresponds to a taxonomic group. B- Tree of γ-proteobacteria obtained from non-corrected signatures (6-letter word signatures and City Block metric). Each color corresponds to a taxonomic group. C- Tree of γ-proteobacteria obtained from the signatures corrected by a zero order Markov model signatures (6-letter word signatures and City Block metric). Each color corresponds to a taxonomic group.

Similar articles

Cited by

References

    1. Lecointre G, Le Guyader H. Classification phylogénétique du vivant. Paris, Belin; 2001. p. 544.
    1. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981;17:368–376. - PubMed
    1. Li WH. In: Molecular Evolution. Sinauer Sinauer A, editor. , Sinauer; 1997. p. 487.
    1. Higgins DG, Thompson JD, Gibson TJ. Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 1996;266:383–402. - PubMed
    1. Brocchieri L. Phylogenetic inferences from molecular sequences: review and critique. Theor Popul Biol. 2001;59:27–40. - PubMed

Publication types

MeSH terms

LinkOut - more resources