Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 26:12:302.
doi: 10.1186/1471-2105-12-302.

Bayesian semi-supervised classification of bacterial samples using MLST databases

Affiliations

Bayesian semi-supervised classification of bacterial samples using MLST databases

Lu Cheng et al. BMC Bioinformatics. .

Abstract

Background: Worldwide effort on sampling and characterization of molecular variation within a large number of human and animal pathogens has lead to the emergence of multi-locus sequence typing (MLST) databases as an important tool for studying the epidemiology and evolution of pathogens. Many of these databases are currently harboring several thousands of multi-locus DNA sequence types (STs) enriched with metadata over traits such as serotype, antibiotic resistance, host organism etc of the isolates. Curators of the databases have thus the possibility of dividing the pathogen populations into subsets representing different evolutionary lineages, geographically associated groups, or other subpopulations, which are defined in terms of molecular similarities and dissimilarities residing within a database. When combined with the existing metadata, such subsets may provide invaluable information for assessing the position of a new set of isolates in relation to the whole pathogen population.

Results: To enable users of MLST schemes to query the databases with sets of new bacterial isolates and to automatically analyze their relation to existing curated sequences, we introduce here a Bayesian model-based method for semi-supervised classification of MLST data. Our method can use an MLST database as a training set and assign simultaneously any set of query sequences into the earlier discovered lineages/populations, while also allowing some or all of these sequences to form previously undiscovered genetically distinct groups. This tool provides probabilistic quantification of the classification uncertainty and is highly efficient computationally, thus enabling rapid analyses of large databases and sets of query sequences. The latter feature is a necessary prerequisite for an automated access through the MLST web interface. We demonstrate the versatility of our approach by anayzing both real and synthesized data from MLST databases. The introduced method for semi-supervised classification of sets of query STs is freely available for Windows, Mac OS X and Linux operative systems in BAPS 5.4 software which is downloadable at http://web.abo.fi/fak/mnf/mate/jc/software/baps.html. The query functionality is also directly available for the Staphylococcus aureus database at http://www.mlst.net and shortly will be available for other species databases hosted at this web portal.

Conclusions: We have introduced a model-based tool for automated semi-supervised classification of new pathogen samples that can be integrated into the web interface of the MLST databases. In particular, when combined with the existing metadata, the semi-supervised labeling may provide invaluable information for assessing the position of a new set of query strains in relation to the particular pathogen population represented by the curated database.Such information will be useful both for clinical and basic research purposes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of a semi-supervised classification of query STs from S. aureus database in the second experiment based on an annotated NJ tree. The STs marked with red and green colors represent the query STs labeled as the two new detected groups and the uncolored STs represent the remaining training data groups.
Figure 2
Figure 2
Example of a semi-supervised classification of query STs from B. cereus database in the third experiment based on an annotated NJ tree. The STs marked with grey colors are the new detected groups. The uncolored STs represent the STs in training data groups and the remaining colored STs are the 25 query STs that were correctly labeled by their respective groups.

References

    1. Maiden M, Bygraves J, Feil E, Morelli G, Russell J, Urwin R, Zhang Q, Zhou J, Zurth K, Caugant D, Feavers I, Achtman M, Spratt B. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(6):3140–3145. doi: 10.1073/pnas.95.6.3140. - DOI - PMC - PubMed
    1. Spratt B. Multilocus sequence typing: molecular typing of bacterial pathogens in an era of rapid DNA sequencing and the internet. Current opinion in microbiology. 1999;2(3):312–316. doi: 10.1016/S1369-5274(99)80054-X. - DOI - PubMed
    1. Feil E, Li B, Aanensen D, Hanage W, Spratt B. eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. Journal of bacteriology. 2004;186(5):1518–1530. doi: 10.1128/JB.186.5.1518-1530.2004. - DOI - PMC - PubMed
    1. Corander J, Tang J. Bayesian analysis of population structure based on linked molecular information. Mathematical biosciences. 2007;205:19–31. doi: 10.1016/j.mbs.2006.09.015. - DOI - PubMed
    1. Corander J, Marttinen P. Bayesian identification of admixture events using multilocus molecular markers. Molecular ecology. 2006;15(10):2833–2843. doi: 10.1111/j.1365-294X.2006.02994.x. - DOI - PubMed

Publication types