Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jun 9;2014(1):96-108.
doi: 10.1093/emph/eou018.

Phylogenetic tree shapes resolve disease transmission patterns

Affiliations

Phylogenetic tree shapes resolve disease transmission patterns

Caroline Colijn et al. Evol Med Public Health. .

Abstract

Background and objectives: Whole-genome sequencing is becoming popular as a tool for understanding outbreaks of communicable diseases, with phylogenetic trees being used to identify individual transmission events or to characterize outbreak-level overall transmission dynamics. Existing methods to infer transmission dynamics from sequence data rely on well-characterized infectious periods, epidemiological and clinical metadata which may not always be available, and typically require computationally intensive analysis focusing on the branch lengths in phylogenetic trees. We sought to determine whether the topological structures of phylogenetic trees contain signatures of the transmission patterns underlying an outbreak.

Methodology: We use simulated outbreaks to train and then test computational classifiers. We test the method on data from two real-world outbreaks.

Results: We show that different transmission patterns result in quantitatively different phylogenetic tree shapes. We describe topological features that summarize a phylogeny's structure and find that computational classifiers based on these are capable of predicting an outbreak's transmission dynamics. The method is robust to variations in the transmission parameters and network types, and recapitulates known epidemiology of previously characterized real-world outbreaks.

Conclusions and implications: There are simple structural properties of phylogenetic trees which, when combined, can distinguish communicable disease outbreaks with a super-spreader, homogeneous transmission and chains of transmission. This is possible using genome data alone, and can be done during an outbreak. We discuss the implications for management of outbreaks.

Keywords: computational modelling; evolutionary epidemiology; genomic epidemiology; machine learning.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic illustration of different kinds of transmission networks. The index case is marked in grey.
Figure 2.
Figure 2.
Distribution of simple summary measures of tree topology
Figure 3.
Figure 3.
Box plots of the features used to summarize the shapes of phylogenies
Figure 4.
Figure 4.
(a) Sensitivity of the SVM classification increases as the variability in the number of secondary cases in the outbreak increases. Variability is quantified as the ratio of the standard deviation to the mean of the numbers of secondary cases caused by an infectious case. Sensitivity is the portion of simulated outbreaks with the corresponding variability that were classed as super-spreader outbreaks; the solid line shows the mean sensitivity over the 10 SVMs produced by cross-validation and dotted lines are the mean ± the standard deviation. (b and c) ROCs for the SVM classifier based on the 11 summary metrics describing tree shape. ROC curves are a visual way to assess the classifier’s quality—a perfect classifier will obtain all the true positives and will have no false positives, giving an AUC of 1. An imperfect classifier has a trade-off, and can attain a specificity (true positive rate) of 1 at the cost of having a false-positive rate of 1 (top right corner of the plot). The ROC curve illustrates the shape of this trade-off; the higher the AUC, the higher the quality of the classifier. Guessing yields an AUC of 0.5. In b, different lines correspond to the different groups of simulations in the SVM sensitivity analysis. Panel c shows the SVM classifier’s performance when only the earliest outbreak isolates are sampled. Performance is poor with 10 isolates (black line) and better with 20 (blue line)

Similar articles

Cited by

References

    1. Stadler T, Kouyos R, Von Wyl V. et al. Estimating the basic reproductive number from viral sequence data. Mol Biol Evol 2012;29:347–57. - PubMed
    1. Köser CU, Holden MT, Ellington MJ. et al. Rapid whole-genome sequencing for investigation of a neonatal MRSA outbreak. N Engl J Med 2012;366:2267–75. - PMC - PubMed
    1. Walker TM, Ip CL, Harrell RH. et al. Whole-genome sequencing to delineate Mycobacterium tuberculosis outbreaks: a retrospective observational study. Lancet Infect Dis 2013;13:137–46. - PMC - PubMed
    1. Grad YH, Lipsitch M, Feldgarden M. et al. Genomic epidemiology of the Escherichia coli o104: H4 outbreaks in Europe, 2011. Proc Natl Acad Sci USA 2012;109:3065–70. - PMC - PubMed
    1. Török ME, Reuter S, Bryant J. et al. Rapid whole-genome sequencing for investigation of a suspected tuberculosis outbreak. J Clin Microbiol 2013;51:611–4. - PMC - PubMed

LinkOut - more resources