. 2010 Jul 13:10:210.

doi: 10.1186/1471-2148-10-210.

BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Alexis Criscuolo¹, Simonetta Gribaldo

Affiliations

Affiliation

¹ Institut Pasteur, Unité de Biologie Moléculaire du Gène chez Extrêmophiles, Département de Microbiologie, 25 rue du Dr Roux, 75015 Paris, France. alexis.criscuolo@pasteur.fr

PMID: 20626897
PMCID: PMC3017758
DOI: 10.1186/1471-2148-10-210

BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Alexis Criscuolo et al. BMC Evol Biol. 2010.

. 2010 Jul 13:10:210.

doi: 10.1186/1471-2148-10-210.

Authors

Alexis Criscuolo¹, Simonetta Gribaldo

Affiliation

¹ Institut Pasteur, Unité de Biologie Moléculaire du Gène chez Extrêmophiles, Département de Microbiologie, 25 rue du Dr Roux, 75015 Paris, France. alexis.criscuolo@pasteur.fr

PMID: 20626897
PMCID: PMC3017758
DOI: 10.1186/1471-2148-10-210

Abstract

Background: The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step.

Results: Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity.

Conclusions: BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/.

PubMed Disclaimer

Figures

**Figure 1**
**ROC graphs plotting true (y-axis) and false (x-axis) positive rates for seven character trimming methods**. The best methods (i.e., that minimize both the number of true negative and false positive characters) are those with the corresponding cloud concentrated around the upper left point (0,1). Under each ROC graph, the average L₁distance between each point and the (0,1) point is given. For each level of divergence, the best (i.e. lower) distance is written in boldface characters. Average L₁distances that are not significantly different to this best value (as assessed by a sign test) are underscored.

**Figure 2**
Distributions of the BioNJ bootstrap-based confidence values on the true branches of model trees estimated from initial (non-trimmed) multiple sequence alignments and from alignments returned by seven character trimming methods. Average confidence values are written under each corresponding histogram. For each level of divergence, the best (i.e. higher) average confidence value is written in boldface characters. Average confidence values associated to distributions that are not significantly different to this best distribution (as assessed by a χ²test) are underscored.

**Figure 3**
Distributions of the aLRT confidence values on the true branches of model trees estimated from initial (non-trimmed) multiple sequence alignments and from alignments returned by seven character trimming methods. See Figure 2.

**Figure 4**
**ROC curves constructed by thresholding BioNJ bootstrap-based confidence values**. Best tree branch classifiers (i.e., that are able to associate higher confidence values to true branches than to false branches) are those that maximize the area under the ROC curve (AUC). For each simulation case, the AUC is given under the corresponding ROC space representation. For each column, the best (i.e. higher) AUC is written in boldface characters. AUCs that are not significantly different to this best value (as assessed by a Z test) are underscored.

**Figure 5**
**ROC curves constructed by thresholding aLRT-based confidence values**. See Figure 4.

**Figure 6**
**Phylogenetic trees obtained from a non-trimmed character supermatrix (left) and from the concatenation of the multiple sequence alignments trimmed by BMGE with BLOSUM95 (right)**. These ML trees were inferred by PhyML with the model mtREV+Γ₈+I. Note that the left topology was also inferred from character supermatrices built by concatenating multiple sequence alignments trimmed by BMGE (BLOSUM30 and BLOSUM62), Gblocks (relaxed), trimAl (strictplus and automated1), and Noisy. Bootstap-based and aLRT-based confidence values at nodes (1), (1'), (2) and (3) are given in Table 6.

**Figure 7**
**Phylogenetic tree used to simulate the non-stationary evolution of a DNA sequence**. This tree and the different branch lengths are closely related to [24].

**Figure 8**
**Frequency of recovered model tree in function of the proportion of regions with heterogeneous composition inside an alignment of four DNA sequences**.

**Figure 9**
**Phylogenetic trees inferred before (left) and after (right) stationary-based character trimming**. Both trees are inferred by minimizing the Minimum Evolution criterion with GTR and LogDet distance estimations. ME bootstrap-based confidence values at branches are all 100%.

See this image and copyright information in PMC

References

1. Lake JA. The order of sequence alignment can bias the selection of tree topology. Mol Biol Evol. 1991;8:378–385. - PubMed
1. Morrison DA, Ellis JT. Effects of nucleotide sequence alignment on phylogeny estimation: a case study on 18 S rDNAs of apicomplexa. Mol Biol Evol. 1997;14:428–441. - PubMed
1. Ogden TH, Rosenberg MS. Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006;55:314–328. doi: 10.1080/10635150500541730. - DOI - PubMed
1. Wang L-S, Leebens-Mack J, Wall PK, Beckmann K, dePamphilis CW, Warnow T. The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Trans Comput Biol Bioinf. 2009. in press . - PubMed
1. Edgar RC, Batzoglou S. Multiple sequence alignment. Curr Opin Struct Biol. 2006;16:368–373. doi: 10.1016/j.sbi.2006.04.004. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Affiliation

BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous