. 2009 Jul 16:10:219.

doi: 10.1186/1471-2105-10-219.

OrthoSelect: a protocol for selecting orthologous groups in phylogenomics

Fabian Schreiber¹, Kerstin Pick, Dirk Erpenbeck, Gert Wörheide, Burkhard Morgenstern

Affiliations

PMID: 19607672
PMCID: PMC2719630
DOI: 10.1186/1471-2105-10-219

OrthoSelect: a protocol for selecting orthologous groups in phylogenomics

Fabian Schreiber et al. BMC Bioinformatics. 2009.

. 2009 Jul 16:10:219.

doi: 10.1186/1471-2105-10-219.

Authors

Fabian Schreiber¹, Kerstin Pick, Dirk Erpenbeck, Gert Wörheide, Burkhard Morgenstern

Affiliation

¹ Abteilung Bioinformatik, Institut für Mikrobiologie und Genetik, Georg-August-Universität Göttingen, Göttingen, Germany. fab.schreiber@gmail.com

PMID: 19607672
PMCID: PMC2719630
DOI: 10.1186/1471-2105-10-219

Abstract

Background: Phylogenetic studies using expressed sequence tags (EST) are becoming a standard approach to answer evolutionary questions. Such studies are usually based on large sets of newly generated, unannotated, and error-prone EST sequences from different species. A first crucial step in EST-based phylogeny reconstruction is to identify groups of orthologous sequences. From these data sets, appropriate target genes are selected, and redundant sequences are eliminated to obtain suitable sequence sets as input data for tree-reconstruction software. Generating such data sets manually can be very time consuming. Thus, software tools are needed that carry out these steps automatically.

Results: We developed a flexible and user-friendly software pipeline, running on desktop machines or computer clusters, that constructs data sets for phylogenomic analyses. It automatically searches assembled EST sequences against databases of orthologous groups (OG), assigns ESTs to these predefined OGs, translates the sequences into proteins, eliminates redundant sequences assigned to the same OG, creates multiple sequence alignments of identified orthologous sequences and offers the possibility to further process this alignment in a last step by excluding potentially homoplastic sites and selecting sufficiently conserved parts. Our software pipeline can be used as it is, but it can also be adapted by integrating additional external programs. This makes the pipeline useful for non-bioinformaticians as well as to bioinformatic experts. The software pipeline is especially designed for ESTs, but it can also handle protein sequences.

Conclusion: OrthoSelect is a tool that produces orthologous gene alignments from assembled ESTs. Our tests show that OrthoSelect detects orthologs in EST libraries with high accuracy. In the absence of a gold standard for orthology prediction, we compared predictions by OrthoSelect to a manually created and published phylogenomic data set. Our tool was not only able to rebuild the data set with a specificity of 98%, but it detected four percent more orthologous sequences. Furthermore, the results OrthoSelect produces are in absolut agreement with the results of other programs, but our tool offers a significant speedup and additional functionality, e.g. handling of ESTs, computing sequence alignments, and refining them. To our knowledge, there is currently no fully automated and freely available tool for this purpose. Thus, OrthoSelect is a valuable tool for researchers in the field of phylogenomics who deal with large quantities of EST sequences. OrthoSelect is written in Perl and runs on Linux/Mac OS X. The tool can be downloaded at (http://gobics.de/fabian/orthoselect.php).

PubMed Disclaimer

Figures

**Figure 1**
**Workflow of OrthoSelect**. The main workflow of the software pipeline to detect ortholog sequences in phylogenomic studies. Input are EST libraries and an ortholog database (either KOG or OrthoMCL) as multi-fasta files. The analysis comprises four parts. (1) The orthology detection – which can be performed on a single computer or a computer cluster – blasts each EST against the ortholog database, selects the closest ortholog group as the best hit and translates it and stored together with the nucleotide sequences in the corresponding OG. (2) Target genes can be selected. (3) The sequence most likely being an ortholog is selected by eliminating potential paralogs. (4) Informative alignment columns are selected to increase the phylogenetic signal.

**Figure 2**
**Workflow of orthology assignment**. Workflow of our software pipeline. The two databases colored in green are to be supplied by the user. The ortholog database is converted into a BLAST database and clustered in ortholog groups. Each contig from the assembled EST library is assigned to the OG returned by a BLASTO search against the ortholog database.

**Figure 3**
**Eliminating redundant sequences**. The figure shows how OrthoSelect eliminates redundant sequences. Here, we have an OG with three sequences from organism A and B and two sequences from organism C. All sequences are aligned in a pairwise manner to compute a distance matrix (left side). That sequence from an organism is selected that most often has the smallest distance to another organism, see section for details (right side).

**Figure 4**
**Rebuilding the multiple sequence alignment**. The figure illustrates how OrthoSelect refines the multiple sequence alignments (MSA) created so far. Based on the MSA a hidden Markov Model (HMM) is build. Additionally, all EST libraries are translated using ESTScan with different matrices (ranging from *Arabidopsis thaliana* to *Homo sapiens*). The software *hmmsearch* from the HMMER package then used the HMM to search all translated sequences and selecting the best hit from each taxon above a given threshold. From these hits the new MSA is then computed

**Figure 5**
**Overview of functionality of OrthoSelect compared to other tools**. The figure illustrates the differences in functionality between OrthoSelect and other tool for orthology prediction. Both approaches have in common that they build clusters of orthologous sequences. Moreover, OrthoSelect can handle EST sequences and correctly translate them and further processes these clusters to select only one sequence per taxon, compute sequence alignments and refine them. In contrast to the other tools, OrthoSelect outputs orthologous gene alignments that can be directly used the subsequent phylogenetic analysis.

See this image and copyright information in PMC

Cited by

Integrating multi-origin expression data improves the resolution of deep phylogeny of ray-finned fish (Actinopterygii).
Zou M, Guo B, Tao W, Arratia G, He S. Zou M, et al. Sci Rep. 2012;2:665. doi: 10.1038/srep00665. Epub 2012 Sep 18. Sci Rep. 2012. PMID: 22993690 Free PMC article.
Basal jawed vertebrate phylogenomics using transcriptomic data from Solexa sequencing.
Chen M, Zou M, Yang L, He S. Chen M, et al. PLoS One. 2012;7(4):e36256. doi: 10.1371/journal.pone.0036256. Epub 2012 Apr 27. PLoS One. 2012. PMID: 22558409 Free PMC article.
A novel codon-based de Bruijn graph algorithm for gene construction from unassembled transcriptomes.
Peng G, Ji P, Zhao F. Peng G, et al. Genome Biol. 2016 Nov 17;17(1):232. doi: 10.1186/s13059-016-1094-x. Genome Biol. 2016. PMID: 27855707 Free PMC article.
Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes.
Petersen M, Meusemann K, Donath A, Dowling D, Liu S, Peters RS, Podsiadlowski L, Vasilikopoulos A, Zhou X, Misof B, Niehuis O. Petersen M, et al. BMC Bioinformatics. 2017 Feb 16;18(1):111. doi: 10.1186/s12859-017-1529-8. BMC Bioinformatics. 2017. PMID: 28209129 Free PMC article.
Insect phylogenomics: exploring the source of incongruence using new transcriptomic data.
Simon S, Narechania A, Desalle R, Hadrys H. Simon S, et al. Genome Biol Evol. 2012;4(12):1295-309. doi: 10.1093/gbe/evs104. Genome Biol Evol. 2012. PMID: 23175716 Free PMC article.

See all "Cited by" articles

References

1. Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics. 2005;6:361–375. doi: 10.1038/nrg1603. - DOI - PubMed
1. Gee H. Evolution: ending incongruence. Nature. 2003;425:798–804. doi: 10.1038/425782a. - DOI - PubMed
1. Eisen JA. Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998;8:163–167. - PubMed
1. Bourlat SJ, Juliusdottir T, Lowe CJ, Freeman R, Aronowicz J, Kirschner M, Lander ES, Thorndyke M, Nakano H, Kohn AB. Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature. 2006;444:85–88. doi: 10.1038/nature05241. - DOI - PubMed
1. Delsuc F, Brinkmann H, Chourrout D, Philippe H. Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature. 2006;439:965–968. doi: 10.1038/nature04336. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

OrthoSelect: a protocol for selecting orthologous groups in phylogenomics

Affiliation

OrthoSelect: a protocol for selecting orthologous groups in phylogenomics

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous