Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Nov;31(11):3081-92.
doi: 10.1093/molbev/msu245. Epub 2014 Aug 25.

Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics

Affiliations

Orthology inference in nonmodel organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics

Ya Yang et al. Mol Biol Evol. 2014 Nov.

Abstract

Orthology inference is central to phylogenomic analyses. Phylogenomic data sets commonly include transcriptomes and low-coverage genomes that are incomplete and contain errors and isoforms. These properties can severely violate the underlying assumptions of orthology inference with existing heuristics. We present a procedure that uses phylogenies for both homology and orthology assignment. The procedure first uses similarity scores to infer putative homologs that are then aligned, constructed into phylogenies, and pruned of spurious branches caused by deep paralogs, misassembly, frameshifts, or recombination. These final homologs are then used to identify orthologs. We explore four alternative tree-based orthology inference approaches, of which two are new. These accommodate gene and genome duplications as well as gene tree discordance. We demonstrate these methods in three published data sets including the grape family, Hymenoptera, and millipedes with divergence times ranging from approximately 100 to over 400 Ma. The procedure significantly increased the completeness and accuracy of the inferred homologs and orthologs. We also found that data sets that are more recently diverged and/or include more high-coverage genomes had more complete sets of orthologs. To explicitly evaluate sources of conflicting phylogenetic signals, we applied serial jackknife analyses of gene regions keeping each locus intact. The methods described here can scale to over 100 taxa. They have been implemented in python with independent scripts for each step, making it easy to modify or incorporate them into existing pipelines. All scripts are available from https://bitbucket.org/yangya/phylogenomic_dataset_construction.

Keywords: Diplopoda; RNA-seq; Vitaceae; phylotranscriptomics.

PubMed Disclaimer

Figures

F<sc>ig</sc>. 1.
Fig. 1.
Flow chart of homology and orthology inferences. .
F<sc>ig</sc>. 2.
Fig. 2.
Ortholog taxon occupation ranked from high to low. Orthologs with less than eight taxa (six for MIL–RT) were not shown.
F<sc>ig</sc>. 3.
Fig. 3.
Maximum-likelihood analysis of the HYM data set. Taxon names were abbreviated to the first four letters of the genus names except the left-most tree. Orthology inference methods: MI, maximum inclusion; RT, extracting rooted ingroup clades; MO, monophyletic outgroups; 1to1, filtered one-to-one orthologs. All nodes received bootstrap and 30% jackknife support values of 100 and are not shown. Node labels are also not shown if all support values are 100. Arrows indicate nodes with relatively low support.
F<sc>ig</sc>. 4.
Fig. 4.
Maximum-likelihood analysis of the GRP CDS data set. Taxon names were replaced by the collection numbers except the top left tree. Orthology inference methods: MI, maximum inclusion; RT, extracting rooted ingroup clades; MO, monophyletic outgroups; 1to1, filtered one-to-one orthologs. Node labels are not shown when all support values are 100. Arrows indicate nodes with relatively low support.
F<sc>ig</sc>. 5.
Fig. 5.
Maximum-likelihood analysis of the MIL. Taxon names were abbreviated to the first four letters of the genus names except the top left tree. Orthology inference methods: MI, maximum inclusion; RT, extracting rooted ingroup clades; MO, monophyletic outgroups; 1to1, filtered one-to-one orthologs. Node labels are not shown when all support values are 100. Arrows indicate nodes with relatively low support.

References

    1. Altenhoff AM, Gil M, Gonnet GH, Dessimoz C. Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One. 2013;8(1):e53786. - PMC - PubMed
    1. Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011;39:D289–D294. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–410. - PubMed
    1. Ané C, Larget B, Baum DA, Smith SD, Rokas A. Bayesian estimation of concordance among gene trees. Mol Biol Evol. 2007;24(2):412–426. - PubMed
    1. Bonasio R, Zhang G, Ye C, Mutti NS, Fang X, Qin N, Donahue G, Yang P, Li Q, Li C. Genomic comparison of the ants Camponotus floridanus and Harpegnathos saltator. Science. 2010;329(5995):1068–1071. - PMC - PubMed

Publication types