Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;8(8):e1002657.
doi: 10.1371/journal.pcbi.1002657. Epub 2012 Aug 23.

Artificial neural networks trained to detect viral and phage structural proteins

Affiliations

Artificial neural networks trained to detect viral and phage structural proteins

Victor Seguritan et al. PLoS Comput Biol. 2012.

Abstract

Phages play critical roles in the survival and pathogenicity of their hosts, via lysogenic conversion factors, and in nutrient redistribution, via cell lysis. Analyses of phage- and viral-encoded genes in environmental samples provide insights into the physiological impact of viruses on microbial communities and human health. However, phage ORFs are extremely diverse of which over 70% of them are dissimilar to any genes with annotated functions in GenBank. Better identification of viruses would also aid in better detection and diagnosis of disease, in vaccine development, and generally in better understanding the physiological potential of any environment. In contrast to enzymes, viral structural protein function can be much more challenging to detect from sequence data because of low sequence conservation, few known conserved catalytic sites or sequence domains, and relatively limited experimental data. We have designed a method of predicting phage structural protein sequences that uses Artificial Neural Networks (ANNs). First, we trained ANNs to classify viral structural proteins using amino acid frequency; these correctly classify a large fraction of test cases with a high degree of specificity and sensitivity. Subsequently, we added estimates of protein isoelectric points as a feature to ANNs that classify specialized families of proteins, namely major capsid and tail proteins. As expected, these more specialized ANNs are more accurate than the structural ANNs. To experimentally validate the ANN predictions, several ORFs with no significant similarities to known sequences that are ANN-predicted structural proteins were examined by transmission electron microscopy. Some of these self-assembled into structures strongly resembling virion structures. Thus, our ANNs are new tools for identifying phage and potential prophage structural proteins that are difficult or impossible to detect by other bioinformatic analysis. The networks will be valuable when sequence is available but in vitro propagation of the phage may not be practical or possible.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Overview of neural network training and evaluation.
Protein sequences were downloaded and unwanted sequences were removed. The percent compositions of amino acids in all protein sequences were calculated and distributed into training, validation, and test sets. The network architecture and an appropriately sized validation set were determined ad hoc by training many networks. We selected those ANNs that correctly classified the highest number of test cases based on 10-fold cross validation. Voting neural networks were generated from 160-fold cross validation and the appropriate number of networks to use in an ensemble was determined by the ensemble with the best accuracy. The voting ensemble that correctly classified the most test cases was used to determine the overall correct classification rate, specificity, and sensitivity.
Figure 2
Figure 2. Taxonomic information of positive structural protein sequences.
The large pie chart in the center of the figure shows the number (in parentheses) and percentage of sequences in the training set that came from virus genomes. Training sequences came from one of the following sources: phages, viruses, other (i.e., prophage genes from bacterial chromosomes; see text), or obsolete. Protein sequences for which GenBank records are no longer available are labeled as “Obsolete”. Adjacent pie charts with colors that are similar to slices in the central chart show the distributions of sequences based on taxonomic information. Sequences that do not have taxonomic data available were labeled as “unclassified”.
Figure 3
Figure 3. Accuracies of networks trained with different architectures and distributions of training and validation sets.
The correct classification rates of Structural, MCP, and Tail ANNs are shown on the Y-axes in all panels. Red diamonds represent mean correct classification frequencies, and the maximum mean frequencies are labeled as red percentage values. In the left set of boxplots, a single number on the x-axis represents the number of neurons in 1 hidden layer; two numbers delimited by an ‘x’ indicate the number of neurons in hidden layers 1 and 2. The left side of Panel A summarizes the 1- and 2-layer architectures of the top 4 networks based on correct classifications, whereas Panels B and C show all networks architectures that were tested. In the right set of boxplots the pairs of numbers on the x-axis represent the percentage of non-test set sequences that were split into the training and validation sets.
Figure 4
Figure 4. Performance of Structural ANNs.
Panel A summarizes the performance of Structural ANN ensembles. Each ensemble consists of an odd number of networks ranging between 5 and 141 voting ANNs, with the exception of the single best performing ANN and the ensemble containing all 160 ANNs. Performance is measured by the accuracy, specificity, and sensitivity of the networks, which were presented the amino acid frequencies of curated phage sequences. The sequences were best classified by the top 5 voting ANNs, which is based on mean accuracy, specificity, and sensitivity values that appear above the red striped columns. The performance of an ensemble of the top 5 voting ANNs was also assessed by curated sequences that were used to test the MCP and Tail ANNs. Histograms in panel B show the accuracy of the Structural ANNs in classifying capsid from non-capsid and tail from non-tail test sequences that were also used to test MCP and Tail ANNs.
Figure 5
Figure 5. Performance of major Capsid and Tail ANNs.
The accuracy, sensitivity, and specificity of Capsid and Tail ANN based on the classifications of test cases from the RefSeq database. All training sets contained the amino acid percent composition (PC) of positive and negative examples. Capsid ANNs trained without and with isoelectric point (pI) values are shown in the left and right histograms of panel A. Tail Protein ANNs trained without and with isoelectric point values are shown in left and right histograms of panel B. Ratios of positive to negative examples are show on the X-axis in ascending order. Error bars represent standard error.
Figure 6
Figure 6. Flow chart of the expression and visualization of hypothetical proteins.
Sequences were searched against the RefSeq and Conserved Domain Databases to remove proteins that have a known function based their annotations. Hypothetical protein sequences are synthesized into genes that are expressed and purified in vivo. Soluble proteins are negatively stained for visualization by TEM. An example image is shown at the bottom right corner of the figure.
Figure 7
Figure 7. ANN results and TEM images of hypothetical proteins from φMa-LMM01.
The locations of ORFs with known (red or blue arrows) and unknown functions (orange arrows) based on GenBank annotations are shown in panel A. Samples 5513–5519 have black labels, which represent ORFs that were identified as structural proteins by ANNs but have no known function. Boxplots (B) summarize the predictions made by Capsid and Tail ANNs. Representative TEM images of purified proteins from Sample 5515 (C) and 5519 (D) are shown.
Figure 8
Figure 8. ANN results and TEM images of a φP-SSM2 hypothetical protein that resembles tail fibers.
A genome map of φP-SSM2 (panel A) shows the locations of ORFs with known (red or blue arrows) and unknown functions (orange arrows). Red or blue labels indicate ORF sequences that are structural proteins based on GenBank annotations. Samples 5520–5525 have black labels, which represent ORFs that were identified as structural proteins by Structural ANNs but have no known function. Boxplots (B) summarize the predictions made by Capsid and Tail ANNs. Panel C shows representative TEM images of soluble, purified proteins from sample 5525 that strongly resemble phage tail fibers.
Figure 9
Figure 9. ANN results and TEM images of a φIEBH hypothetical protein that resembles procapsids.
ORFs of known (red or blue arrows) and unknown functions (orange arrows) are shown on a genome map of φIEBH in panel A. Sample 5607 (black label) is shown as an ORF that is identified as a structural protein by ANNs but have no known function. Boxplots (B) summarize the predictions made by Capsid and Tail ANNs based on the sequence of protein 5607. Representative TEM images of soluble, purified proteins that were expressed from the sequence of protein 5607 are shown in panel C.
Figure 10
Figure 10. ANN results and TEM images of a putative major capsid protein from φBcepC6B gp15.
Panel A shows the locations of predicted ORFs in the genome map of φBcepC6B. Red or blue labels and arrows indicate ORF sequences that are structural proteins based on GenBank annotations. Hypothetical proteins are shown as orange arrows. Protein 5610 (black label), or φBcepC6B gp15, is identified as a structural protein by ANNs and has no known function. Boxplots (panel B) summarize predictions of protein 5610 that were made by Capsid and Tail ANNs. Representative TEM images (panel C) of soluble, purified proteins that were expressed from the sequence of 5610 resemble procapsid structures with various morphologies. The red image outlined in panel D shows an empty procapsid, which resembles a “broken” head structure (blue outlined image) of Pseudomonas aeruginosa PA0 phage F116, and is illustrated by a white inset image . Panel E shows images of a “folded” head structure from the soluble, purified proteins of 5610 (red outline) that is similar to that of φF116 (blue outline with white inset image [51]).

Similar articles

Cited by

References

    1. Rohwer F, Prangishvili D, Lindell D (2009) Roles of viruses in the environment. Environ Microbiol 11: 2771–2774. - PubMed
    1. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, et al. (2008) Functional metagenomic profiling of nine biomes. Nature 452: 629–632. - PubMed
    1. Dinsdale EA, Pantos O, Smriga S, Edwards RA, Angly F, et al. (2008) Microbial ecology of four coral atolls in the Northern Line Islands. PLoS One 3: e1584. - PMC - PubMed
    1. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The marine viromes of four oceanic regions. PLoS Biol 4: e368. - PMC - PubMed
    1. Suttle CA (2007) Marine viruses-major players in the global ecosystem. Nat Rev Microbiol 5: 801–812. - PubMed

Publication types

LinkOut - more resources