Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 22;14(8):e0221068.
doi: 10.1371/journal.pone.0221068. eCollection 2019.

TreeCluster: Clustering biological sequences using phylogenetic trees

Affiliations

TreeCluster: Clustering biological sequences using phylogenetic trees

Metin Balaban et al. PLoS One. .

Abstract

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. When the phylogenetic tree is ultrametric, clustering is trivial.
For a threshold α, cut the tree at α2 height (A). When the tree is not ultrametric, it is not obvious how to cluster leaves (B). In both cases, a set of cut edges defines a clustering.
Fig 2
Fig 2. Comparing Greengenes and TreeCluster.
(A) Cluster diversity (Eq 1) for Greengenes and TreeCluster versus the number of OTUs. Cluster diversity is measured both with respect to hamming distance and tree-based distance. The threshold α is shown for all data points corresponding to Greengenes and for some points of TreeCluster. See S1 Fig for comparison to other TreeCluster modes. (B) Average-average (ν) and average-maximum (ξ) distance to the centroid for Greengenes and TreeCluster versus the number of clusters. TreeCluster centroids are computed using ancestral state reconstruction or using consensus.
Fig 3
Fig 3. Effectiveness of transmission clustering.
Effectiveness is measured as the average number of individuals infected by the selected 1,000 individuals. The horizontal axis depicts the expected time to begin ART (A), the expected degree (i.e., number of sexual contacts) for individuals in the contact network (B), and the number of clusters using various thresholds (C).
Fig 4
Fig 4. Alignment error for PASTA using the centroid and the mincut decompositions.
We show Sum of Pairs False Negative (SPFN) and Sum of Pairs False Psotive (SPFP) computed using FastSP [57] over two datasetes: the simulated RNASim dataset (10 replicates) and the biological HomFam dataset (19 largest families; all 20 largest, except “rhv” omitted due to the warning on the Pfam website). We show boxplots in addition to mean (red dot) and standard error (red error bars).
Fig 5
Fig 5. An example showing that number of minimal clusterings under a diameter threshold can be exponential of number of leaves.
When the threshold is 3.5, each unit has to be split into two clusters, and there are thus three equally-optimal ways of splitting. The minimum number of clusters is therefore 2n. The total number of distinct optimal solutions is 3n, whereas there are 3n leaves.
Fig 6
Fig 6. Execution times of Cluster Picker, HIV-TRACE, and TreeCluster in log-scale.
Execution times (in seconds) are shown for each tool for various values of n sequences, with 10 replicates for each n. The full dataset was obtained by downloading all HIV-1 subtype B pol sequences (HXB2 coordinates 2,253 to 3,549) from the Los Alamos National Laboratory (LANL) database. All programs were run on a CentOS 5.8 machine with an Intel Xeon X7560 2.27 GHz CPU.

References

    1. Goodrich JK, Di Rienzi SC, Poole AC, Koren O, Walters WA, Caporaso JG, et al. Conducting a microbiome study. Cell. 2014;158(2):250–262. 10.1016/j.cell.2014.06.037 - DOI - PMC - PubMed
    1. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–2461. 10.1093/bioinformatics/btq461 - DOI - PubMed
    1. Schloss PD, Handelsman J. Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness. Applied and Environmental Microbiology. 2005;71(3):1501–1506. 10.1128/AEM.71.3.1501-1506.2005 - DOI - PMC - PubMed
    1. Ragonnet-Cronin M, Hodcroft E, Hué S, Fearnhill E, Delpech VC, Brown AJL, et al. Automated analysis of phylogenetic clusters. BMC bioinformatics. 2013;14:317 10.1186/1471-2105-14-317 - DOI - PMC - PubMed
    1. Kosakovsky Pond SL, Weaver S, Leigh Brown AJ, Wertheim JO. HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens. Molecular Biology and Evolution. 2018;35(7):1812–1819. 10.1093/molbev/msy016 - DOI - PMC - PubMed

Publication types