Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 8;14(4):774.
doi: 10.3390/v14040774.

An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology

Affiliations

An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology

Colin Young et al. Viruses. .

Abstract

The use of viral sequence data to inform public health intervention has become increasingly common in the realm of epidemiology. Such methods typically utilize multiple sequence alignments and phylogenies estimated from the sequence data. Like all estimation techniques, they are error prone, yet the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly used viral phylogenetic analysis workflows on simulated viral sequence data, modeling Human Immunodeficiency Virus (HIV), Hepatitis C Virus (HCV), and Ebolavirus, and we computed multiple methods of accuracy, motivated by transmission-clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega, in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was the fastest, by orders of magnitude, and when the other tools were used to optimize branch lengths along a fixed FastTree 2 topology, the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime.

Keywords: bioinformatics; epidemiology; multiple sequence alignment; phylogenetics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Kernel density estimates of the branch length distributions for the Ebola, HIV, and HCV true phylogenies.
Figure 1
Figure 1
Kernel density estimates of the branch length distributions for the Ebola, HIV, and HCV true phylogenies.
Figure 2
Figure 2
Metrics of sequence alignment accuracy for MAFFT, MUSCLE, and Clustal Omega on 10 simulated replicate datasets of HIV, HCV, and Ebola. Violin plots are shown for Mean Squared Error, Spearman/Pearson Mantel Correlation, SP score, TC score, and Compression Factor.
Figure 3
Figure 3
Metrics of phylogenetic inference accuracy for FastTree, IQ-TREE (GTR), IQ-TREE (MFP), RAxML-NG, and PhyML on 10 simulated replicate datasets of HIV, HCV, and Ebola. Phylogenies which result from optimizing branch lengths along FastTree topology are also included. Violin plots are shown for URF, WRF, Pearson Mantel Correlation, and Mean Squared Error. Violin plots showing Spearman Mantel Correlation can be found in Supplementary Figure S1.
Figure 4
Figure 4
Heat maps comparing the accuracy of phylogenies inferred with FastTree, IQ-TREE (GTR), IQ-TREE (MFP), RAxML-NG, and PhyML from the MAFFT, Clustal Omega, MUSCLE, and true MSAs. Each value of Unweighted Robinson–Foulds (URF), Weighted Robinson–Foulds (WRF), Pearson Mantel Correlation, and Mean Squared Error shown is the average of 10 simulation replicates. Heatmaps showing Spearman Mantel Correlation can be found in Supplementary Figure S2.
Figure 5
Figure 5
Heat maps comparing the accuracy of FastTree topologies inferred from the MAFFT, Clustal Omega, MUSCLE, and true multiple sequence alignments with branch lengths optimized by IQ-TREE (GTR), IQ-TREE (MFP), RAxML-NG, and PhyML. Each value of Unweighted Robinson–Foulds (URF), Weighted Robinson–Foulds (WRF), Pearson Mantel Correlation, and Mean Squared Error shown is the average of 10 simulation replicates. Heatmaps showing Spearman Mantel Correlation can be found in Supplementary Figure S3.

References

    1. Hall B.G. Building Phylogenetic Trees from Molecular Data with MEGA. Mol. Biol. Evol. 2013;30:1229–1235. doi: 10.1093/molbev/mst012. - DOI - PubMed
    1. Kosakovsky Pond S.L., Weaver S., Leigh Brown A.J., Wertheim J.O. HIV-TRACE (TRAnsmission Cluster Engine): A Tool for Large Scale Molecular Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens. Mol. Biol. Evol. 2018;35:1812–1819. doi: 10.1093/molbev/msy016. - DOI - PMC - PubMed
    1. Balaban M., Moshiri N., Mai U., Jia X., Mirarab S. TreeCluster: Clustering biological sequences using phylogenetic trees. PLoS ONE. 2019;14:e0221068. doi: 10.1371/journal.pone.0221068. - DOI - PMC - PubMed
    1. Ragonnet-Cronin M., Hodcroft E., Hué S., Fearnhill E., Delpech V., Brown A.J., Lycett S. UK HIV Drug Resistance Database. Automated analysis of phylogenetic clusters. BMC Bioinform. 2013;14:317. doi: 10.1186/1471-2105-14-317. - DOI - PMC - PubMed
    1. Prosperi M.C., Ciccozzi M., Fanti I., Saladini F., Pecorari M., Borghi V., Di Giambenedetto S., Bruzzone B., Capetti A., Vivarelli A., et al. A novel methodology for large-scale phylogeny partition. Nat. Commun. 2011;2:321. doi: 10.1038/ncomms1325. - DOI - PMC - PubMed

Publication types