Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015;16 Suppl 6(Suppl 6):S2.
doi: 10.1186/1471-2105-16-S6-S2. Epub 2015 Apr 17.

Probabilistic topic modeling for the analysis and classification of genomic sequences

Probabilistic topic modeling for the analysis and classification of genomic sequences

Massimo La Rosa et al. BMC Bioinformatics. 2015.

Abstract

Background: Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques.

Methods: The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences.

Results and conclusions: We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.

PubMed Disclaimer

Figures

Figure 1
Figure 1
k-mers decomposition. By means of a sliding windows of fixed size k, k = 8 in this case, it is possible to extract all the overlapping k-mers, representing the words, from a gene sequence.
Figure 2
Figure 2
Training workflow. From the sequences of the input DNA dataset are extracted the words through the k-mer decomposition; then using the Latent Dirichlet Allocation (LDA) algorithm a probabilistic topic model is learned. The model provides the topic distribution of the input dataset, retrieved from the Ribosomal Database Project (RDP) online repository, and the most probable topics are labeled with a taxonomic rank using a majority voting scheme.
Figure 3
Figure 3
Testing workflow. From the test sequences are extracted the words through the k-mer decomposition; then, by means of fitted topic models learned during the training phase, the topic distributions of test sequences are computed. Finally each sequence is assigned to its most probable topic, and, since topics have been labeled during the training phase, the predicted rank for the test sequences is obtained.
Figure 4
Figure 4
Precision scores at phylum level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at phylum taxonomic rank. The dashed line for the RDP curve represents extrapolated values.
Figure 5
Figure 5
Precision scores at class level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at class taxonomic rank. The dashed line for the RDP curve represents extrapolated values.
Figure 6
Figure 6
Precision scores at order level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at order taxonomic rank. The dashed line for the RDP curve represents extrapolated values.
Figure 7
Figure 7
Precision scores at family level. Precision scores, defined as true positives/(true positives + false positives), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at family taxonomic rank. The dashed line for the RDP curve represents extrapolated values.

References

    1. Drancourt M, Raoult D. Sequence-based identification of new bacteria: a proposition for creation of an orphan bacterium repository. J Clin Microbiol. 2005;43(9):4311–4315. doi: 10.1128/JCM.43.9.4311-4315.2005. - DOI - PMC - PubMed
    1. Gaston KJ. Global patterns in biodiversity. Nature. 2000;405(6783):220–7. doi: 10.1038/35012228. - DOI - PubMed
    1. Drancourt M, Berger P, Raoult D. Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans. Journal of Clinical Microbiology. 2004;42(5):2197–2202. doi: 10.1128/JCM.42.5.2197-2202.2004. - DOI - PMC - PubMed
    1. Hebert PDN, Ratnasingham S, DeWaard JR. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proceedings of the Royal Society Series B, Biological sciences. 2003;270(Suppl 1):96–99. - PMC - PubMed
    1. Nei M, Kumar MD. Molecular Evolution and Phylogenetics. Oxford University Press, New York; 2000.