Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 May 19:12:174.
doi: 10.1186/1471-2105-12-174.

Rebooting the human mitochondrial phylogeny: an automated and scalable methodology with expert knowledge

Affiliations

Rebooting the human mitochondrial phylogeny: an automated and scalable methodology with expert knowledge

Roberto Blanco et al. BMC Bioinformatics. .

Abstract

Background: Mitochondrial DNA is an ideal source of information to conduct evolutionary and phylogenetic studies due to its extraordinary properties and abundance. Many insights can be gained from these, including but not limited to screening genetic variation to identify potentially deleterious mutations. However, such advances require efficient solutions to very difficult computational problems, a need that is hampered by the very plenty of data that confers strength to the analysis.

Results: We develop a systematic, automated methodology to overcome these difficulties, building from readily available, public sequence databases to high-quality alignments and phylogenetic trees. Within each stage in an autonomous workflow, outputs are carefully evaluated and outlier detection rules defined to integrate expert knowledge and automated curation, hence avoiding the manual bottleneck found in past approaches to the problem. Using these techniques, we have performed exhaustive updates to the human mitochondrial phylogeny, illustrating the power and computational scalability of our approach, and we have conducted some initial analyses on the resulting phylogenies.

Conclusions: The problem at hand demands careful definition of inputs and adequate algorithmic treatment for its solutions to be realistic and useful. It is possible to define formal rules to address the former requirement by refining inputs directly and through their combination as outputs, and the latter are also of help to ascertain the performance of chosen algorithms. Rules can exploit known or inferred properties of datasets to simplify inputs through partitioning, therefore cutting computational costs and affording work on rapidly growing, otherwise intractable datasets. Although expert guidance may be necessary to assist the learning process, low-risk results can be fully automated and have proved themselves convenient and valuable.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Individual sequence features. These descriptors operate directly on sequences. Due to their simplicity, they comprise the first tests to be applied to any prospective members of a dataset. (a) The sequence length histogram locates unusually short or long sequences, commonly classifying correct genomes as belonging to strict or flexible sets, and also detecting outliers which cannot be straightforwardly ascribed to either group. Blue dots mark accepted strict sequences; red dots, outlier strict sequences; and green dots, flexible and not strict sequences. (b) The ambiguity covering histogram serves as an aid for determining acceptable ambiguity thresholds and approximates a simple measure of aggregated quality. (The green dot shows the base covering of fully defined sequences, with zero ambiguous positions.)
Figure 2
Figure 2
Parsimony edit distances. This figure plots the edit distance histogram of the strict database. The intraspecies markers are: blue dots for H. sapiens distances, red dots for H. neanderthalensis, and dark yellow dots for H. sp. altai (the latter two are clearly limited by the available sequences). Interspecies markers are: purple dots for H. sapiens-H. neanderthalensis, green dots for H. sapiens-H. sp. altai, and orange dots for H. neanderthalensis-H. sp. altai. The separation between all three species is clearly visible.
Figure 3
Figure 3
Updated human mitochondrial phylogeny. The left phylogram shows the base binary tree with its associated branch lengths (note the long Neanderthal and Altai clades next to L0). The right cladogram presents the aggregate bootstrap consensus and locates the main haplogroups as defined by MITOMAP's simplified lineages. Both trees are rooted according to the directionality of said classification.
Figure 4
Figure 4
Genetic conservation measures. Conservation statistics are useful to evaluate single sequences and extract knowledge from the combined pool of available data. Incomplete sequences have to be discarded at least partially for results to reflect true polymorphic variations exclusively; the following plots do not include non-strict data. (a) The sequential conservation profile of the alignment indicates regions and positions of special interest. (b) When this profile is transformed into the conservation frequency histogram, some global trends become apparent. Blue dots are used for α ≥ 0.99, green dots for 0.95 ≤ α <0.99, and red dots for α <0.95. Note that under these thresholds, the great majority of mutations affects conserved positions: P(α <0.95) = 0.570%, P(α <0.99) = 2.905%. In view of these extreme levels of conservation, it may be interesting to tune α to adjust the significance of "high" conservation levels to the raw amounts of closely related data.
Figure 5
Figure 5
Mutational statistics for the reference strict phylogeny. Quantitative and statistical analysis of tree properties not only offers useful information about the evolutionary processes under study, but is also a very powerful means of detecting discordances at all stages of the reconstruction. Strict trees are particularly amenable to this treatment. (a) The histogram of point mutations per branch highlights typical patterns of evolution and identifies unusual and possibly error-prone generation points. Red dots represent outliers; and blue dots mark standard ranges and legitimate exceptions within the outlier range, with the green dot signaling the frequency of empty clades. (b) The histogram of individual mutation frequencies (i.e., the number of generation points for each mutation or group of related mutations) aims at the identification of especially important and recurrent mutations and the subsequent study of their patterns of generation.
Figure 6
Figure 6
System workflow architecture. Individual algorithms and testing procedures are integrated into a workflow that directs and automates their interactions. Storage stages are interleaved with transformation stages, which are either algorithmic (marked as arrow-shaped triangles, they explicitly advance the resolution of the problem associated to their input) or restrictive (marked with a diamond shape, these tests refine algorithmic inputs diverting simple flows through feedback loops to previous storage stages). Note many transformation stages are actually concurrent scatter-gather processes (e.g., gene alignments, bootstrap replicates, etc.).

References

    1. Torroni A, Achilli A, Macaulay V, Richards M, Bandelt HJ. Harvesting the fruit of the human mtDNA tree. Trends Genet. 2006;22:339–345. doi: 10.1016/j.tig.2006.04.001. - DOI - PubMed
    1. Cann RL, Stoneking M, Wilson AC. Mitochondrial DNA and human evolution. Nature. 1987;325:31–36. doi: 10.1038/325031a0. - DOI - PubMed
    1. Ruiz-Pesini E, Mishmar D, Brandon M, Procaccio V, Wallace DC. Effects of purifying and adaptive selection on regional variation in human mtDNA. Science. 2004;303:223–226. doi: 10.1126/science.1088434. - DOI - PubMed
    1. Wallace DC. A mitochondrial paradigm of metabolic and degenerative diseases, aging, and cancer: a dawn for evolutionary medicine. Annu Rev Genet. 2005;39:359–407. doi: 10.1146/annurev.genet.39.110304.095751. - DOI - PMC - PubMed
    1. Cavalli-Sforza LL. The Human Genome Diversity Project: past, present and future. Nat Rev Genet. 2005;6:333–340. - PubMed

Publication types

Substances

LinkOut - more resources