Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(2):e56925.
doi: 10.1371/journal.pone.0056925. Epub 2013 Feb 25.

The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study

Affiliations

The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study

Daniel A Dalquen et al. PLoS One. 2013.

Abstract

The identification of orthologous genes, a prerequisite for numerous analyses in comparative and functional genomics, is commonly performed computationally from protein sequences. Several previous studies have compared the accuracy of orthology inference methods, but simulated data has not typically been considered in cross-method assessment studies. Yet, while dependent on model assumptions, simulation-based benchmarking offers unique advantages: contrary to empirical data, all aspects of simulated data are known with certainty. Furthermore, the flexibility of simulation makes it possible to investigate performance factors in isolation of one another.Here, we use simulated data to dissect the performance of six methods for orthology inference available as standalone software packages (Inparanoid, OMA, OrthoInspector, OrthoMCL, QuartetS, SPIMAP) as well as two generic approaches (bidirectional best hit and reciprocal smallest distance). We investigate the impact of various evolutionary forces (gene duplication, insertion, deletion, and lateral gene transfer) and technological artefacts (ambiguous sequences) on orthology inference. We show that while gene duplication/loss and insertion/deletion are well handled by most methods (albeit for different trade-offs of precision and recall), lateral gene transfer disrupts all methods. As for ambiguous sequences, which might result from poor sequencing, assembly, or genome annotation, we show that they affect alignment score-based orthology methods more strongly than their distance-based counterparts.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Orthology inference vs. gene duplication.
Precision/recall of orthology inference with different proportions of genes with a history of duplications. Each data point corresponds to the mean over all orthologous relations in five replicates (with 95% confidence interval of the mean values in both dimensions).
Figure 2
Figure 2. Orthology inference vs. gene duplication with varying loss rates.
Precision/recall of orthology inference with different proportions of genes with a history of duplications and varying relative loss rates. Each data point corresponds to the mean over all orthologous relations in five replicates (with 95% confidence interval of the mean values in both dimensions).
Figure 3
Figure 3. Orthology inference vs. LGT.
Precision/recall of orthology predictions with different proportions of genes with a history of lateral gene transfer. Each data point corresponds to the mean over all orthologous relations in five replicates (with 95% confidence interval of the mean values in both dimensions).
Figure 4
Figure 4. Orthology inference vs. insertions and deletions.
Precision/recall of orthology predictions with different rates of insertion and deletion events. Each data point corresponds to the mean of over all orthologous relations in five replicates (with 95% confidence interval of the mean values in both dimensions).
Figure 5
Figure 5. Alignment score vs. distance.
Pairwise alignment scores compared to Percent Accepted Mutation (PAM) distance for one run of mammalia-like dataset 1. A) For insertion and deletion rate 0.001. Scores were normalised by the sum of the aligned characters in both sequences. formula image; B) with 18 percent ambiguous characters. Scores were normalised by the sum of the aligned characters in both sequences. formula image.
Figure 6
Figure 6. Orthology inference vs. sequencing artefacts.
Precision/recall of orthology predictions with different proportions of ambiguous (i.e. “X”) characters. Each data point corresponds to the mean of over all orthologous relations in five replicates (with 95% confidence interval of the mean values in both dimensions).
Figure 7
Figure 7. Species trees for bacteria-like dataset 1 and mammalia-like dataset 1.
Species trees used in the simulations of bacteria-like dataset 1 (A), sampled from bacteria tree, and mammalia-like dataset 1 (B), sampled from mammalia tree.

References

    1. Ohno S (1970) Evolution by Gene Duplication. Springer Verlag.
    1. Koonin EV (2005) Orthologs, paralogs, and evolutionary genomics. Annual Review of Genetics 39: 309–338 doi:10.1146/annurev.genet.39.073003.114725. - DOI - PubMed
    1. Altenhoff AM, Studer RA, Robinson-Rechavi M, Dessimoz C (2012) Resolving the ortholog conjecture: orthologs tend to be weakly, but significantly, more similar in function than paralogs. PLoS computational biology 8: e1002514 doi:10.1371/journal.pcbi.1002514. - DOI - PMC - PubMed
    1. Kristensen DM, Wolf YI, Mushegian AR, Koonin EV (2011) Computational methods for Gene Orthology inference. Briefings in Bioinformatics 12: 379–391 doi:10.1093/bib/bbr030. - DOI - PMC - PubMed
    1. Altenhoff AM, Dessimoz C (2012) Inferring orthology and paralogy. In Anisimova M, editor, Evolutionary Genomics, Clifton, NJ: Springer Verlag. pp. 259–279. doi:10.1007/978-1-61779-582-4{\-}9. - PubMed

Publication types

MeSH terms