Comparative Study

. 2002 May 16:3:14.

doi: 10.1186/1471-2105-3-14.

RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs

Christian M Zmasek¹, Sean R Eddy

Affiliations

PMID: 12028595
PMCID: PMC116988
DOI: 10.1186/1471-2105-3-14

Comparative Study

RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs

Christian M Zmasek et al. BMC Bioinformatics. 2002.

. 2002 May 16:3:14.

doi: 10.1186/1471-2105-3-14.

Authors

Christian M Zmasek¹, Sean R Eddy

Affiliation

¹ Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, St Louis, MO 63110, USA. zmasek@genetics.wustl.edu

PMID: 12028595
PMCID: PMC116988
DOI: 10.1186/1471-2105-3-14

Abstract

Background: When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication). The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics") is widely recognized, but existing approaches are either manual or not explicitly based on phylogenetic trees.

Results: Here we present RIO (Resampled Inference of Orthologs), a procedure for automated phylogenomics using explicit phylogenetic inference. RIO analyses are performed over bootstrap resampled phylogenetic trees to estimate the reliability of orthology assignments. We also introduce supplementary concepts that are helpful for functional inference. RIO has been implemented as Perl pipeline connecting several C and Java programs. It is available at http://www.genetics.wustl.edu/eddy/forester/. A web server is at http://www.rio.wustl.edu/. RIO was tested on the Arabidopsis thaliana and Caenorhabditis elegans proteomes.

Conclusion: The RIO procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. We also describe how some orthologies can be misleading for functional inference.

PubMed Disclaimer

Figures

**Figure 1**
**Over annotation due to database bias or gene loss under equal rates of evolution** Species harboring the sequences are indicated. Two cases are depicted. In A, the query sequence belongs to the "Y" subfamily which can be correctly inferred by both sequence similarity and phylogenetic tree based methods (in situation A, the query is most similar to "Y" of rat and mouse). In short, in situation A, orthology and "most similar" do (partially) overlap. In B, a situation is depicted where the query is actually a member of a third subfamily "X" but this can only be inferred by considering the evolutionary history of this sequence family. Sequence similarity based methods would misleadingly indicate that this query belongs to "Y" since it is most similar to "Y" in rat, mouse and wheat. In short, in situation B, orthology and "most similar" do not correspond. Observe that if there would have been already members of "X" in the database (no gene loss and complete sampling) the query in B could have been correctly determined to belong to a "X" subfamily (under equal rates of evolution).

**Figure 2**
**Over annotation due to unequal rates of evolution** Sequence similarity based methods would indicate that the query is a member of the "Z" subfamily. Phylogenetic tree based methods correctly identify it as a member of subfamily "Y".

**Figure 3**
**The reasons for introducing super-orthologs** Examples of how inferring the biological role of a query sequence by simply transferring functional annotation from a orthologous sequence might lead to inaccuracies. These potential pitfalls lead us to introduce the concept of super-orthologs (Definition 1).

**Figure 4**
An example of ultra-paralogous sequences

**Figure 5**
**An illustration of subtree-neighbors** The dotted subtrees could either be just one external node or a subtree of arbitrary size and topology. Species information is of no consequence for the concept of subtree-neighbors. The subtree-neighbors depicted here are for the default of k = 2.

**Figure 10**
**A phylogenetic tree for O-methyltransferases produced by RIO** This tree is based on the Pfam alignment Methyltransf_2 (PF00891). It has been constructed in the same manner as the tree in Figure 8. (TOBAC: *Nicotiana tabacum*, ARATH: *Arabidopsis thaliana*, MAIZE: *Zea mays*, HORVU: *Hordeum vulgare*, WHEAT: *Triticum aestivum*, PEA: *Pisum sativum*, RHOSH: *Rhodobacter sphaeroides*, RHOCA: *Rhodobacter capsulatus*, BOVIN: *Bos taurus*, CHICK: *Gallus gallus*, RAT: *Rattus norvegicus*, MYCTU: *Mycobacterium tuberculosis*.). The *A. thaliana* query sequence F16P17_38 is labeled with Q. The bootstrap orthology values for potential orthologs are indicated in brackets (the brightness of the green color is proportional to this value). The apparent trifurcation at the root is caused by a branch length of 0.0 (the bacterial hydroxyneurosporene methyltransferases subtree and the plant O-methyltransferases subtree are connected by a speciation event). Inferred gene duplication are indicated by circles. According to this tree, F16P17_38 has orthologs only in bacteria.

**Figure 11**
**RIO output for the *A. thaliana* protein F16P17_38 analyzed against the Pfam Methyltransf_2 domain alignment (PF00891)** For an explanation of the output see Figure 7. The output is sorted by orthology values. According to this RIO analysis the orthologs of F16P17_38 are bacterial hydroxyneurosporene methyltransferases. These contrast with the subtree-neighbors of F16P17_38 which are all plant O-methyltransferases.

**Figure 6**
**A simple example of the RIO procedure** Four bootstrap resampled gene trees are shown. Letters represent sequence names/"functions". "A" (nematode and wheat) are true orthologs of the human query sequence, whereas "B" (rat) is a true paralog of the query (i.e. the first tree happens to be the real one). In 3 out of 4 trees nematode "A" appears orthologous to the query, in 3 out of 4 trees wheat "A" appears orthologous to the query. Rat "B" never appears to be orthologous. For an example of actual RIO output see Figure 7.

**Figure 7**
**RIO output for the *A. thaliana* protein F12M16_14 analyzed against the Pfam ldh domain alignment (PF00056)** The "Sequence" column identifies sequences in the Pfam alignment either by their SWISS-PROT "ID" or their TrEMBL "AC" [36] with added species information (the numbers after the dash are the Pfam domain boundaries added by HMMER). "Description" is the "DE" information either from SWISS-PROT or TrEMBL. The number of observed orthologies ("o"), subtree-neighborings ("n"), and super-orthologies ("s") to the query in the 100 bootstrapped trees are indicated (in %) for the sequences in the Pfam alignment. Furthermore the evolutionary distances (average number of amino acid replacements per residue calculated by maximum likelihood based on the BLOSUM 62 matrix) between the query and the sequences in the Pfam alignment are shown. For space reasons some lines of the output are not shown ("...") (the complete output is available at http://www.genetics.wustl.edu/eddy/forester/rio_analyses/RIO_paper/AT_LDH_MDH/). The output is sorted by orthology values. According to this RIO analysis the query sequence is likely to be orthologous and a subtree-neighbor to the plant sequences MDHM_BRANA and Q9SPB8_SOYBN. In addition, the query is likely to be super-orthologous to MDHM_BRANA. The bacterial sequences MDH_ECOLI and MDH_SALTY are also possibly orthologs but no subtree-neighbors. Hence, F12M16_14 is very likely to be a malate dehydrogenase and possibly mitochondrial.

**Figure 8**
**A phylogenetic tree for zinc-binding dehydrogenases produced by RIO** This tree is based on the Pfam alignment adh_zinc (PF00107) and is a subtree of a larger tree. It has been calculated by the neighbor joining method using maximum likelihood pairwise distances [34] based on the BLOSUM 62 matrix [25]. Gene duplication are indicated by circles (inferred by our SDI algorithm [13]). The tree was rooted by minimizing the sum of duplications. The tree image was produced by ATV [33]. Species are represented by their SWISS-PROT abbreviations (ARATH: *Arabidopsis thaliana*, TOBAC: *Nicotiana tabacum*, MAIZE: *Zea mays*, MYCTU: *Mycobacterium tuberculosis*, BACSU: *Bacillus subtilis*, LEIMA: *Leishmania major*, HELPY: *Helicobacter pylori*, SYNY3: *Synechocystis* sp. strain PCC 6803, YEAST: *Saccharomyces cerevisiae*, KLULA: *Kluyveromyces lactis*, KLUMA: *Kluyveromyces marxianus*, CANAL: *Candida albicans*, EMENI: *Emericella nidulans*, SCHPO: *Schizosaccharomyces pombe*, CAEEL: *Caenorhabditis elegans*, BACST: *Bacillus stearothermophilus*). The *A. thaliana* query sequence F28P22_13 is labeled with Q. The bootstrap orthology values for potential orthologs are indicated in brackets. According to this tree, F28P22_13 has no orthologs.

**Figure 9**
**RIO output for the *A. thaliana* protein F28P22_13 analyzed against the Pfam adh_zinc domain alignment (PF00107)** For an explanation of the output see Figure 7. For space reasons some lines of the output are not shown ("...") (the complete output is available at http://www.genetics.wustl.edu/eddy/forester/rio_analyses/RIO_paper/F28P22_13/). The output is sorted by orthology values. According to this RIO analysis the query sequence is likely to have no orthologs in this alignment. In contrast, the query probably has subtree-neighbors which are cinnamyl-alcohol dehydrogenases (EC 1.1.1.195), NADP-dependent alcohol dehydrogenases (EC 1.1.1.2), as well as other zinc-containing alcohol dehydrogenases.

See this image and copyright information in PMC

References

1. Dayhoff MO. The origin and evolution of protein superfamilies. Fed Proc. 1976;35:2132–2138. - PubMed
1. Ingram VM. Gene evolution and the haemoglobins. Nature. 1961;189:704–708. - PubMed
1. Haldane JBS. The causes of evolution. New York and London: Harper & Brothers Publishers; 1932.
1. Ohno S. Evolution by gene duplication. New York: Springer-Verlag; 1970.
1. Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

HG01363/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs

Affiliation

RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources