Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 1;35(7):1159-1166.
doi: 10.1093/bioinformatics/bty772.

Phylogenetic approaches to identifying fragments of the same gene, with application to the wheat genome

Affiliations

Phylogenetic approaches to identifying fragments of the same gene, with application to the wheat genome

Ivana Piližota et al. Bioinformatics. .

Abstract

Motivation: As the time and cost of sequencing decrease, the number of available genomes and transcriptomes rapidly increases. Yet the quality of the assemblies and the gene annotations varies considerably and often remains poor, affecting downstream analyses. This is particularly true when fragments of the same gene are annotated as distinct genes, which may cause them to be mistaken as paralogs.

Results: In this study, we introduce two novel phylogenetic tests to infer non-overlapping or partially overlapping genes that are in fact parts of the same gene. One approach collapses branches with low bootstrap support and the other computes a likelihood ratio test. We extensively validated these methods by (i) introducing and recovering fragmentation on the bread wheat, Triticum aestivum cv. Chinese Spring, chromosome 3B; (ii) by applying the methods to the low-quality 3B assembly and validating predictions against the high-quality 3B assembly; and (iii) by comparing the performance of the proposed methods to the performance of existing methods, namely Ensembl Compara and ESPRIT. Application of this combination to a draft shotgun assembly of the entire bread wheat genome revealed 1221 pairs of genes that are highly likely to be fragments of the same gene. Our approach demonstrates the power of fine-grained evolutionary inferences across multiple species to improving genome assemblies and annotations.

Availability and implementation: An open source software tool is available at https://github.com/DessimozLab/esprit2.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Conceptual overview of the likelihood ratio test. The null hypothesis is that the two fragments come from the same gene (Hs) while the alternative hypothesis is that the two fragments come from different paralogous copies (Hp). α refers to the significance level, which is the area under the curve above the rejection threshold. This setup is motivated by the fact that the split gene hypothesis has fewer parameters. However, it is unusual in that failure to reject the test leads to a prediction, and not the other way round. Furthermore, because the two models are not nested, we estimate the null distribution empirically
Fig. 2.
Fig. 2.
Evaluation of the methods. (a) Wheat genes from the high-quality wheat 3B chromosome were artificially fragmented and recovered by the collapsing, likelihood ratio test (LRT) and a combination of the two. Numbers indicate the threshold used for each datapoint. (b) Split genes inferred on the low-quality (‘survey’) wheat genome were validated using the high-quality wheat 3B, and comparison with three other approaches (Ensembl Compara, ESPRIT and the meta-method). Numbers indicate the threshold used for each datapoint. The meta-method takes union of ESPRIT’s and the predictions inferred when combining collapsing approach (threshold 0.95) and LRT (significance 0.01). (c) The number of predictions on 3B survey sequence classified as correct in the BLAST+ validation. ’New approach’ denotes a combination (intersection) of collapsing approach (threshold 0.95) and LRT (significance 0.01)
Fig. 3.
Fig. 3.
High-confidence inferred gene splits on the wheat genome. A, B and D refer to the three subgenomes of the hexaploid wheat genome. (a) Number of unambiguous predictions for each chromosome arm. (b) Number of ambiguous predictions (i.e. for which there are more than two candidate fragments for a single juncture). Pairs of fragments are inferred separately for each chromosome arm of flow-sorted Triticum aestivum cv. Chinese Spring, except chromosome 3B, for which the analysis was performed on the entire chromosome

References

    1. Altenhoff A.M. et al. (2013) Inferring hierarchical orthologous groups from orthologous gene pairs. PLoS One, 8, e53786.. - PMC - PubMed
    1. Altenhoff A.M. et al. (2014) The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements. Nucleic Acids Res., 43, D1, D240–D249. - PMC - PubMed
    1. Bredeson J.V. et al. (2016) Sequencing wild and cultivated cassava and related species reveals extensive interspecific hybridization and genetic diversity. Nat. Biotechnol., 34, 562–570. - PubMed
    1. Camacho C. et al. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421.. - PMC - PubMed
    1. Choulet F. et al. (2014) Structural and functional partitioning of bread wheat chromosome 3B. Science, 345, 1249721.. - PubMed

Publication types