Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 11;18(1):208.
doi: 10.1186/s12859-017-1602-3.

A machine learning approach for viral genome classification

Affiliations

A machine learning approach for viral genome classification

Mohamed Amine Remita et al. BMC Bioinformatics. .

Abstract

Background: Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families.

Results: Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments.

Conclusion: The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca .

Keywords: Prediction; Sequence classification; Virus classification.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overview of CASTOR kernel architecture. The kernel is composed of two main units (classifier construction and prediction). White rectangles represent input and output data; grey and curved rectangles represent processes. TS and VS are training set and validation set, respectively
Fig. 2
Fig. 2
Class cohesion of three virus datasets. The four columns illustrate the separability and compactness of three virus complete genomes datasets based on 172 restriction enzyme cuts. The first column shows heatmaps of CUT clustered by x-axis. The samples in the y-axis are grouped by studied classes followed by intra-class clusterings. The second column shows MDS of the CUT distances between samples. The third and fourth column represent, respectively, the Cohesion and Silhouette indexes of the classes. a Classes in HPV are Alpha species, Beta and Gamma genera. b Classes in HBV are A-H genotypes c Classes in HIV-1 are M pure subtypes and CRFs
Fig. 3
Fig. 3
Learning algorithm evaluation on five datasets. This figure illustrates the F-measure distribution (boxplot) of seven learning algorithms on the prediction of a HPV genera, b HPV Alpha species, c HBV genotypes, d HIV-1 M subtypes with complete genomes e HIV-1 M subtypes with pol fragments. HPV and HBV datasets are complete genomes. The number below each boxplot corresponds to the statistically discriminative rank of the algorithms. The ranking is performed with paired Student’s t test. μ, σ are the mean and the standard deviation of the overall F-measures, respectively. p is the p-value of the statistically significance of the weighted F-measure mean differences among the algorithms computed with the Wilcoxon/Kruskal-Wallis test
Fig. 4
Fig. 4
Performance of CASTOR with COMET and REGA predictors on HIV-1 datasets. The panels a and b show the percentage of correct classifications for HIV-1 complete genomes and HIV-1 pol fragments, respectively. The number of instances and the associated classes for each sampling is presented above the panels. Complete sampling corresponds to 10% of Los Alamos HIV data selected randomly. In specific subtypes sampling, the predictors are assessed against their trained classes. In common subtypes sampling, the predictors are assessed against the intersection of the classes of the three trained predictors

Similar articles

Cited by

References

    1. Van Belkum A, Struelens M, de Visser A, Verbrugh H, Tibayrenc M. Role of genomic typing in taxonomy, evolutionary genetics, and microbial epidemiology. Clin Microbiol Rev. 2001;14(3):547–60. doi: 10.1128/CMR.14.3.547-560.2001. - DOI - PMC - PubMed
    1. Struck D, Lawyer G, Ternes AM, Schmit JC, Bercoff DP. Comet: adaptive context-based modeling for ultrafast hiv-1 subtype identification. Nucleic Acids Res. 2014;42(18):e144. doi: 10.1093/nar/gku739. - DOI - PMC - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Edgar RC. Search and clustering orders of magnitude faster than blast. Bioinformatics. 2010;26(19):2460–1. doi: 10.1093/bioinformatics/btq461. - DOI - PubMed
    1. Bao Y, Chetvernin V, Tatusova T. Improvements to pairwise sequence comparison (PASC): a genome-based web tool for virus classification. Arch Virol. 2014;159(12):3293–304. doi: 10.1007/s00705-014-2197-x. - DOI - PMC - PubMed

LinkOut - more resources