Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 1;36(7):1580-1595.
doi: 10.1093/molbev/msz053.

Phylogenetic Clustering by Linear Integer Programming (PhyCLIP)

Affiliations

Phylogenetic Clustering by Linear Integer Programming (PhyCLIP)

Alvin X Han et al. Mol Biol Evol. .

Abstract

Subspecies nomenclature systems of pathogens are increasingly based on sequence data. The use of phylogenetics to identify and differentiate between clusters of genetically similar pathogens is particularly prevalent in virology from the nomenclature of human papillomaviruses to highly pathogenic avian influenza (HPAI) H5Nx viruses. These nomenclature systems rely on absolute genetic distance thresholds to define the maximum genetic divergence tolerated between viruses designated as closely related. However, the phylogenetic clustering methods used in these nomenclature systems are limited by the arbitrariness of setting intra and intercluster diversity thresholds. The lack of a consensus ground truth to define well-delineated, meaningful phylogenetic subpopulations amplifies the difficulties in identifying an informative distance threshold. Consequently, phylogenetic clustering often becomes an exploratory, ad hoc exercise. Phylogenetic Clustering by Linear Integer Programming (PhyCLIP) was developed to provide a statistically principled phylogenetic clustering framework that negates the need for an arbitrarily defined distance threshold. Using the pairwise patristic distance distributions of an input phylogeny, PhyCLIP parameterizes the intra and intercluster divergence limits as statistical bounds in an integer linear programming model which is subsequently optimized to cluster as many sequences as possible. When applied to the hemagglutinin phylogeny of HPAI H5Nx viruses, PhyCLIP was not only able to recapitulate the current WHO/OIE/FAO H5 nomenclature system but also further delineated informative higher resolution clusters that capture geographically distinct subpopulations of viruses. PhyCLIP is pathogen-agnostic and can be generalized to a wide variety of research questions concerning the identification of biologically informative clusters in pathogen phylogenies. PhyCLIP is freely available at http://github.com/alvinxhan/PhyCLIP, last accessed March 15, 2019.

Keywords: influenza; molecular epidemiology; nomenclature; pathogen; phylogenetic clustering.

PubMed Disclaimer

Figures

<sc>Fig</sc>. 1.
Fig. 1.
Schematics of PhyCLIP workflow and inference. (A) Workflow of PhyCLIP. Apart from an appropriately rooted phylogenetic tree, users only need to provide S, γ, and FDR as the inputs for PhyCLIP. After determining the within-cluster WCL, PhyCLIP dissociates distantly related subtrees and outlying sequences that inflate the mean patristic distance (μi) of ancestral subtrees. The ILP model is then implemented and optimized to assign cluster membership to as many sequences as possible. If a prior of cluster membership is given, this is followed by a secondary optimization to retain as much of the prior membership as is statistically supportable within the limits of PhyCLIP. Post-ILP optimization clean-up steps are taken before yielding finalized clustering output. (B) PhyCLIP considers the phylogeny as an ensemble of monophyletic subtrees, each defined by an internal node (circled numbers) subtended by a set of sequences (letters encapsulated within shaded region of the same color as the circled number). In this example, only subtrees with 3 sequences (S=3) are considered for clustering by the ILP model but WCL is determined from μi of all subtrees, including the unshaded subtrees 6–8. Only subtrees where μiWCL are eligible for clustering. (C) Subtrees o and q, as well as sequence j9 are dissociated from subtree ias they are exceedingly distant from i. If sequences j1, j2, j4 and j5 are clustered under subtree h whereas j3 is clustered under subtree i by ILP optimization, a post-ILP clean up step will remove j3 from cluster i.
<sc>Fig</sc>. 2.
Fig. 2.
Influence of parameters on the clustering properties of PhyCLIP in the WHO/OIE/FAO -update phylogeny. Figure AF has the parameter set combinations ordered according to minimum cluster size, FDR and γ on the x-axis. The banded background and x-axis subscript numbering indicate the minimum cluster size of the parameter set. Marker color and size is indicative of the γ and the FDR respectively of the parameter set as indicated by the legend in figure B. (A) Total number of clusters. (B) Percentage of sequences clustered. (C) Grand mean of the pairwise patristic distance distribution. (D) Mean of the intercluster distance to all other clusters. (E) Mean within-cluster geographic distance calculated in Vicenty miles. (F) Mean within-cluster SD in collection dates.
<sc>Fig</sc>. 3.
Fig. 3.
Phylogeny of the Clade 2.1x viruses circulating in Indonesia. The WHO/OIE/FAO H5 nomenclature is annotated in black. PhyCLIP’s cluster designation is indicated in blue, corresponding to tip color. PhyCLIP’s supercluster topology is exemplified by Cluster A. The source population of the supercluster is annotated as A in pink, with tips colored yellow. The divergent descendant clusters are annotated as A.1, A.2, and A.3 respectively here. The letter A here is shorthand for its nomenclature address, 1.4.1.5.5.4.2. This nomenclature address indicates that supercluster A is the second descendant of cluster 1.4.1.5.5.4 (indicated in light purple), which in turn is the forth descendant of the source supercluster 1.4.1.5.5, indicated in red. See “Materials and Methods” section for full explanation of nomenclature addresses.
<sc>Fig</sc>. 4.
Fig. 4.
PhyCLIP’s delineation of WHO/OIE/FAO demarcated clades 2.3.2.1a (A) and 2.3.2.1c (B). Tips are colored according to PhyCLIP’s cluster designation. The tips colored in red in B are viruses that were designated as outliers by PhyCLIP’s outlier detection. Countries represented by single viruses in the cluster are indicated with an asterisk.

References

    1. Aldous JL, Pond SK, Poon A, Jain S, Qin H, Kahn JS, Kitahata M, Rodriguez B, Dennis AM, Boswell SL, et al. 2012. Characterizing HIV transmission networks across the United States. Clin Infect Dis. 55(8):1135–1143. - PMC - PubMed
    1. Anisimova M, Gil M, Dufayard J-F, Dessimoz C, Gascuel O.. 2011. Survey of branch support methods demonstrates accuracy, power, and robustness of fast likelihood-based approximation schemes. Syst Biol. 60(5):685–699. - PMC - PubMed
    1. Boskova V, Stadler T, Magnus C.. 2018. The influence of phylodynamic model specifications on parameter estimates of the Zika virus epidemic. Virus Evol. 4:1–14. - PMC - PubMed
    1. Burk RD, Chen Z, Harari A, Smith BC, Kocjan BJ, Maver PJ, Poljak M.. 2011. Classification and nomenclature system for human Alphapapillomavirus variants: general features, nucleotide landmarks and assignment of HPV6 and HPV11 isolates to variant lineages. Acta Dermatovenerol Alp Pannonica Adriat. 20:113–123. - PMC - PubMed
    1. Dennis AM, Herbeck JT, Brown AL, Kellam P, de Oliveira T, Pillay D, Fraser C, Cohen MS.. 2014. Phylogenetic studies of transmission dynamics in generalized HIV epidemics: an essential tool where the burden is greatest? J Acquir Immune Defic Syndr. 67(2):181–195. - PMC - PubMed

Publication types