. 2012;8(8):e1002657.

doi: 10.1371/journal.pcbi.1002657. Epub 2012 Aug 23.

Artificial neural networks trained to detect viral and phage structural proteins

Victor Seguritan¹, Nelson Alves Jr, Michael Arnoult, Amy Raymond, Don Lorimer, Alex B Burgin Jr, Peter Salamon, Anca M Segall

Affiliations

PMID: 22927809
PMCID: PMC3426561
DOI: 10.1371/journal.pcbi.1002657

Artificial neural networks trained to detect viral and phage structural proteins

Victor Seguritan et al. PLoS Comput Biol. 2012.

. 2012;8(8):e1002657.

doi: 10.1371/journal.pcbi.1002657. Epub 2012 Aug 23.

Authors

Victor Seguritan¹, Nelson Alves Jr, Michael Arnoult, Amy Raymond, Don Lorimer, Alex B Burgin Jr, Peter Salamon, Anca M Segall

Affiliation

¹ Program of Computational Science, San Diego State University, San Diego, California, United States of America.

PMID: 22927809
PMCID: PMC3426561
DOI: 10.1371/journal.pcbi.1002657

Abstract

Phages play critical roles in the survival and pathogenicity of their hosts, via lysogenic conversion factors, and in nutrient redistribution, via cell lysis. Analyses of phage- and viral-encoded genes in environmental samples provide insights into the physiological impact of viruses on microbial communities and human health. However, phage ORFs are extremely diverse of which over 70% of them are dissimilar to any genes with annotated functions in GenBank. Better identification of viruses would also aid in better detection and diagnosis of disease, in vaccine development, and generally in better understanding the physiological potential of any environment. In contrast to enzymes, viral structural protein function can be much more challenging to detect from sequence data because of low sequence conservation, few known conserved catalytic sites or sequence domains, and relatively limited experimental data. We have designed a method of predicting phage structural protein sequences that uses Artificial Neural Networks (ANNs). First, we trained ANNs to classify viral structural proteins using amino acid frequency; these correctly classify a large fraction of test cases with a high degree of specificity and sensitivity. Subsequently, we added estimates of protein isoelectric points as a feature to ANNs that classify specialized families of proteins, namely major capsid and tail proteins. As expected, these more specialized ANNs are more accurate than the structural ANNs. To experimentally validate the ANN predictions, several ORFs with no significant similarities to known sequences that are ANN-predicted structural proteins were examined by transmission electron microscopy. Some of these self-assembled into structures strongly resembling virion structures. Thus, our ANNs are new tools for identifying phage and potential prophage structural proteins that are difficult or impossible to detect by other bioinformatic analysis. The networks will be valuable when sequence is available but in vitro propagation of the phage may not be practical or possible.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Overview of neural network training and evaluation.**
Protein sequences were downloaded and unwanted sequences were removed. The percent compositions of amino acids in all protein sequences were calculated and distributed into training, validation, and test sets. The network architecture and an appropriately sized validation set were determined ad hoc by training many networks. We selected those ANNs that correctly classified the highest number of test cases based on 10-fold cross validation. Voting neural networks were generated from 160-fold cross validation and the appropriate number of networks to use in an ensemble was determined by the ensemble with the best accuracy. The voting ensemble that correctly classified the most test cases was used to determine the overall correct classification rate, specificity, and sensitivity.

**Figure 3. Accuracies of networks trained with different architectures and distributions of training and validation sets.**
The correct classification rates of Structural, MCP, and Tail ANNs are shown on the Y-axes in all panels. Red diamonds represent mean correct classification frequencies, and the maximum mean frequencies are labeled as red percentage values. In the left set of boxplots, a single number on the x-axis represents the number of neurons in 1 hidden layer; two numbers delimited by an ‘x’ indicate the number of neurons in hidden layers 1 and 2. The left side of Panel A summarizes the 1- and 2-layer architectures of the top 4 networks based on correct classifications, whereas Panels B and C show all networks architectures that were tested. In the right set of boxplots the pairs of numbers on the x-axis represent the percentage of non-test set sequences that were split into the training and validation sets.

**Figure 4. Performance of Structural ANNs.**
Panel A summarizes the performance of Structural ANN ensembles. Each ensemble consists of an odd number of networks ranging between 5 and 141 voting ANNs, with the exception of the single best performing ANN and the ensemble containing all 160 ANNs. Performance is measured by the accuracy, specificity, and sensitivity of the networks, which were presented the amino acid frequencies of curated phage sequences. The sequences were best classified by the top 5 voting ANNs, which is based on mean accuracy, specificity, and sensitivity values that appear above the red striped columns. The performance of an ensemble of the top 5 voting ANNs was also assessed by curated sequences that were used to test the MCP and Tail ANNs. Histograms in panel B show the accuracy of the Structural ANNs in classifying capsid from non-capsid and tail from non-tail test sequences that were also used to test MCP and Tail ANNs.

**Figure 5. Performance of major Capsid and Tail ANNs.**
The accuracy, sensitivity, and specificity of Capsid and Tail ANN based on the classifications of test cases from the RefSeq database. All training sets contained the amino acid percent composition (PC) of positive and negative examples. Capsid ANNs trained without and with isoelectric point (pI) values are shown in the left and right histograms of panel A. Tail Protein ANNs trained without and with isoelectric point values are shown in left and right histograms of panel B. Ratios of positive to negative examples are show on the X-axis in ascending order. Error bars represent standard error.

**Figure 6. Flow chart of the expression and visualization of hypothetical proteins.**
Sequences were searched against the RefSeq and Conserved Domain Databases to remove proteins that have a known function based their annotations. Hypothetical protein sequences are synthesized into genes that are expressed and purified *in vivo*. Soluble proteins are negatively stained for visualization by TEM. An example image is shown at the bottom right corner of the figure.

**Figure 7. ANN results and TEM images of hypothetical proteins from φMa-LMM01.**
The locations of ORFs with known (red or blue arrows) and unknown functions (orange arrows) based on GenBank annotations are shown in panel A. Samples 5513–5519 have black labels, which represent ORFs that were identified as structural proteins by ANNs but have no known function. Boxplots (B) summarize the predictions made by Capsid and Tail ANNs. Representative TEM images of purified proteins from Sample 5515 (C) and 5519 (D) are shown.

**Figure 8. ANN results and TEM images of a φP-SSM2 hypothetical protein that resembles tail fibers.**
A genome map of φP-SSM2 (panel A) shows the locations of ORFs with known (red or blue arrows) and unknown functions (orange arrows). Red or blue labels indicate ORF sequences that are structural proteins based on GenBank annotations. Samples 5520–5525 have black labels, which represent ORFs that were identified as structural proteins by Structural ANNs but have no known function. Boxplots (B) summarize the predictions made by Capsid and Tail ANNs. Panel C shows representative TEM images of soluble, purified proteins from sample 5525 that strongly resemble phage tail fibers.

**Figure 9. ANN results and TEM images of a φIEBH hypothetical protein that resembles procapsids.**
ORFs of known (red or blue arrows) and unknown functions (orange arrows) are shown on a genome map of φIEBH in panel A. Sample 5607 (black label) is shown as an ORF that is identified as a structural protein by ANNs but have no known function. Boxplots (B) summarize the predictions made by Capsid and Tail ANNs based on the sequence of protein 5607. Representative TEM images of soluble, purified proteins that were expressed from the sequence of protein 5607 are shown in panel C.

**Figure 10. ANN results and TEM images of a putative major capsid protein from φBcepC6B gp15.**
Panel A shows the locations of predicted ORFs in the genome map of φBcepC6B. Red or blue labels and arrows indicate ORF sequences that are structural proteins based on GenBank annotations. Hypothetical proteins are shown as orange arrows. Protein 5610 (black label), or φBcepC6B gp15, is identified as a structural protein by ANNs and has no known function. Boxplots (panel B) summarize predictions of protein 5610 that were made by Capsid and Tail ANNs. Representative TEM images (panel C) of soluble, purified proteins that were expressed from the sequence of 5610 resemble procapsid structures with various morphologies. The red image outlined in panel D shows an empty procapsid, which resembles a “broken” head structure (blue outlined image) of *Pseudomonas aeruginosa* PA0 phage F116, and is illustrated by a white inset image . Panel E shows images of a “folded” head structure from the soluble, purified proteins of 5610 (red outline) that is similar to that of φF116 (blue outline with white inset image [51]).

See this image and copyright information in PMC

Cited by

Identification of Phage Viral Proteins With Hybrid Sequence Features.
Ru X, Li L, Wang C. Ru X, et al. Front Microbiol. 2019 Mar 26;10:507. doi: 10.3389/fmicb.2019.00507. eCollection 2019. Front Microbiol. 2019. PMID: 30972038 Free PMC article.
Identification of Known and Novel Recurrent Viral Sequences in Data from Multiple Patients and Multiple Cancers.
Friis-Nielsen J, Kjartansdóttir KR, Mollerup S, Asplund M, Mourier T, Jensen RH, Hansen TA, Rey-Iglesia A, Richter SR, Nielsen IB, Alquezar-Planas DE, Olsen PV, Vinner L, Fridholm H, Nielsen LP, Willerslev E, Sicheritz-Pontén T, Lund O, Hansen AJ, Izarzugaza JM, Brunak S. Friis-Nielsen J, et al. Viruses. 2016 Feb 19;8(2):53. doi: 10.3390/v8020053. Viruses. 2016. PMID: 26907326 Free PMC article.
The human gut virome: a multifaceted majority.
Ogilvie LA, Jones BV. Ogilvie LA, et al. Front Microbiol. 2015 Sep 11;6:918. doi: 10.3389/fmicb.2015.00918. eCollection 2015. Front Microbiol. 2015. PMID: 26441861 Free PMC article. Review.
PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method.
Charoenkwan P, Kanthawong S, Schaduangrat N, Yana J, Shoombuatong W. Charoenkwan P, et al. Cells. 2020 Feb 3;9(2):353. doi: 10.3390/cells9020353. Cells. 2020. PMID: 32028709 Free PMC article.
Predicting Bacteriophage Enzymes and Hydrolases by Using Combined Features.
Li HF, Wang XF, Tang H. Li HF, et al. Front Bioeng Biotechnol. 2020 Mar 24;8:183. doi: 10.3389/fbioe.2020.00183. eCollection 2020. Front Bioeng Biotechnol. 2020. PMID: 32266225 Free PMC article.

See all "Cited by" articles

References

1. Rohwer F, Prangishvili D, Lindell D (2009) Roles of viruses in the environment. Environ Microbiol 11: 2771–2774. - PubMed
1. Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, et al. (2008) Functional metagenomic profiling of nine biomes. Nature 452: 629–632. - PubMed
1. Dinsdale EA, Pantos O, Smriga S, Edwards RA, Angly F, et al. (2008) Microbial ecology of four coral atolls in the Northern Line Islands. PLoS One 3: e1584. - PMC - PubMed
1. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. (2006) The marine viromes of four oceanic regions. PLoS Biol 4: e368. - PMC - PubMed
1. Suttle CA (2007) Marine viruses-major players in the global ecosystem. Nat Rev Microbiol 5: 801–812. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- BacDive
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Artificial neural networks trained to detect viral and phage structural proteins

Affiliation

Artificial neural networks trained to detect viral and phage structural proteins

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials