Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 1;39(12):btad692.
doi: 10.1093/bioinformatics/btad692.

EPIK: precise and scalable evolutionary placement with informative k-mers

Affiliations

EPIK: precise and scalable evolutionary placement with informative k-mers

Nikolai Romashchenko et al. Bioinformatics. .

Abstract

Motivation: Phylogenetic placement enables phylogenetic analysis of massive collections of newly sequenced DNA, when de novo tree inference is too unreliable or inefficient. Assuming that a high-quality reference tree is available, the idea is to seek the correct placement of the new sequences in that tree. Recently, alignment-free approaches to phylogenetic placement have emerged, both to circumvent the need to align the new sequences and to avoid the calculations that typically follow the alignment step. A promising approach is based on the inference of k-mers that can be potentially related to the reference sequences, also called phylo-k-mers. However, its usage is limited by the time and memory-consuming stage of reference data preprocessing and the large numbers of k-mers to consider.

Results: We suggest a filtering method for selecting informative phylo-k-mers based on mutual information, which can significantly improve the efficiency of placement, at the cost of a small loss in placement accuracy. This method is implemented in IPK, a new tool for computing phylo-k-mers that significantly outperforms the software previously available. We also present EPIK, a new software for phylogenetic placement, supporting filtered phylo-k-mer databases. Our experiments on real-world data show that EPIK is the fastest phylogenetic placement tool available, when placing hundreds of thousands and millions of queries while still providing accurate placements.

Availability and implementation: IPK and EPIK are freely available at https://github.com/phylo42/IPK and https://github.com/phylo42/EPIK. Both are implemented in C++ and Python and supported on Linux and MacOS.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
IPK and EPIK. (a) IPK takes an alignment of reference sequences and a reference tree as input. The alignment and tree are preprocessed, most notably the tree is extended by adding ghost nodes and ghost branches. Then state probabilities are computed at every ghost node and alignment site. On the basis of this, a phylo-k-mer database is constructed. Finally, mutual information values are computed and stored alongside the rest of the phylo-k-mer database for later use. (b) EPIK takes the phylo-k-mer database (with the corresponding reference tree and mutual information values) and query sequences as input. Depending on the way EPIK is called, the whole or a part of the phylo-k-mer database is loaded in memory. The k-mers of every query are searched in the database to place it onto the reference phylogeny. The result is a phylogenetic placement file (.jplace-formatted), one for each input query file.
Figure 2.
Figure 2.
Results on phylogenetic placement accuracy of different tools. One point of the distribution represents placements of multiple queries placed to a particular pruned tree. The y-axis corresponds to the mean node distances for such groups of queries and their placements. (a) Metabarcoding markers datasets. (b) Full-genome viral datasets consisting of nucleotide sequences (HCV, HIV) and amino acid sequences (D140).
Figure 3.
Figure 3.
Results on k-mer filtering for the mutual information filter and random selection of k-mers. The x-axis corresponds to the ratio of filtered versus unfiltered database size. The y-axis corresponds to the mean node distances of placements (averaged over 30 prunings) obtained with filtered databases. Vertical bars represent standard errors of the means. (a) Results for metabarcoding markers. (b) Results for viral genomes.
Figure 4.
Figure 4.
Running time of different phylogenetic placement tools on four reference datasets. Preprocessing time is not counted for alignment-based methods (dotted lines) nor for alignment-free ones (solid lines). Measurements are averaged over three runs.

References

    1. Asnicar F, Thomas AM, Beghini F. et al. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0. Nat Commun 2020;11:2500–10. - PMC - PubMed
    1. Balaban M, Jiang Y, Roush D. et al. Fast and accurate distance-based phylogenetic placement using divide and conquer. Mol Ecol Resour 2022;22:1213–27. - PubMed
    1. Barbera P, Kozlov AM, Czech L. et al. EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst Biol 2019;68:365–9. - PMC - PubMed
    1. Barbera P, Czech L, Lutteropp S. et al. SCRAPP: a tool to assess the diversity of microbial samples from phylogenetic placements. Mol Ecol Resour 2021;21:340–9. - PMC - PubMed
    1. Bass D, Czech L, Williams BA. et al. Clarifying the relationships between Microsporidia and Cryptomycota. J Eukaryot Microbiol 2018;65:773–82. - PMC - PubMed

Publication types