Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Feb 1;27(3):317-25.
doi: 10.1093/bioinformatics/btq651. Epub 2010 Dec 1.

maxAlike: maximum likelihood-based sequence reconstruction with application to improved primer design for unknown sequences

Affiliations

maxAlike: maximum likelihood-based sequence reconstruction with application to improved primer design for unknown sequences

Peter Menzel et al. Bioinformatics. .

Abstract

Motivation: The task of reconstructing a genomic sequence from a particular species is gaining more and more importance in the light of the rapid development of high-throughput sequencing technologies and their limitations. Applications include not only compensation for missing data in unsequenced genomic regions and the design of oligonucleotide primers for target genes in species with lacking sequence information but also the preparation of customized queries for homology searches.

Results: We introduce the maxAlike algorithm, which reconstructs a genomic sequence for a specific taxon based on sequence homologs in other species. The input is a multiple sequence alignment and a phylogenetic tree that also contains the target species. For this target species, the algorithm computes nucleotide probabilities at each sequence position. Consensus sequences are then reconstructed based on a certain confidence level. For 37 out of 44 target species in a test dataset, we obtain a significant increase of the reconstruction accuracy compared to both the consensus sequence from the alignment and the sequence of the nearest phylogenetic neighbor. When considering only nucleotides above a confidence limit, maxAlike is significantly better (up to 10%) in all 44 species. The improved sequence reconstruction also leads to an increase of the quality of PCR primer design for yet unsequenced genes: the differences between the expected T(m) and real T(m) of the primer-template duplex can be reduced by ~26% compared with other reconstruction approaches. We also show that the prediction accuracy is robust to common distortions of the input trees. The prediction accuracy drops by only 1% on average across all species for 77% of trees derived from random genomic loci in a test dataset.

Availability: maxAlike is available for download and web server at: http://rth.dk/resources/maxAlike.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The steps of the maxAlike algorithm. From the input, consisting of a multiple alignment and a phylogenetic tree, the algorithm computes PSSMs and reconstructed sequences for the target species. The output can readily be applied to primer design and homology search.
Fig. 2.
Fig. 2.
Dataset MZ44-2: median MATCH scores for maxAlike (ML) and nucleotide frequency (Freq) PSSMs for each species compared with the average distance to its phylogenetically closest neighbor.
Fig. 3.
Fig. 3.
Dataset MZ44-1: recovery rates in percent for sequences reconstructed by maxAlike (ML), frequency-based consensus (Freq) and nearest neighbor (NN). Each point is one species plotted as its average distance to the phylogenetically nearest neighbor. (a) threshold 0.5. (b) no threshold.
Fig. 4.
Fig. 4.
Dataset MZ44-1: (a) Average change of total recovery rates across all species for different sets of input trees: gene tree (F); reference species tree (S); (1–10) bins with trees estimated from other genomic loci; increasing bin number corresponds to higher topological distance to reference tree. (b) Change in the Tm difference due to increased number of mismatches in the primer sequence.
Fig. 5.
Fig. 5.
Average change of total recovery rates across all species for different sets of input trees: gene tree (F); reference species tree (S); bins with trees having distorted branch lengths using the specified relative normal errors. (a) MZ44-1. (b) MZ44-2.
Fig. 6.
Fig. 6.
Dataset MZ44-2: average differences of the expected and actual melting temperature Tm of the primer–template duplex for primers derived from maxAlike (threshold 0.5) and Freq (threshold 0.5) reconstructed sequences and nearest neighbor (NN) sequence for each species, sorted by average distance to its phylogenetically nearest neighbor.

Similar articles

Cited by

References

    1. Boutros R, et al. UniPrime2: a web service providing easier Universal Primer design. Nucleic Acids Res. 2009;37:W209–W213. - PMC - PubMed
    1. Browser UG. The ucsc 44 way alignments. 2010 Available at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=cons44way.
    1. Cha RS, Thilly WG. Specificity, efficiency, and fidelity of PCR. PCR Methods Appl. 1993;3:S18–S29. - PubMed
    1. Contreras-Moreira B, et al. primers4clades: a web server that uses phylogenetic trees to design lineage-specific PCR primers for metagenomic and diversity studies. Nucleic Acids Res. 2009;37:W95–W100. - PMC - PubMed
    1. Díaz-Uriarte R, Garland T. Effects of branch length errors on the performance of phylogenetically independent contrasts. Syst. Biol. 1998;47:654–672. - PubMed

Publication types