Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

My V T Phan^{1

2}, Tue Ngo Tri³, Pham Hong Anh³, Stephen Baker³, Paul Kellam^{4

5}, Matthew Cotten^{1

2}

Affiliations

¹ Virus Genomics, Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.
² Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands.
³ Wellcome Trust Major Overseas Programme, Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam.
⁴ Department of Infection and Immunity, Imperial College London, London, UK.
⁵ Kymab Ltd, Babraham Research Campus, Cambridge, UK.

PMID: 30568804
PMCID: PMC6295324
DOI: 10.1093/ve/vey035

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

My V T Phan et al. Virus Evol. 2018.

. 2018 Dec 15;4(2):vey035.

doi: 10.1093/ve/vey035. eCollection 2018 Jul.

Authors

My V T Phan^{1

2}, Tue Ngo Tri³, Pham Hong Anh³, Stephen Baker³, Paul Kellam^{4

5}, Matthew Cotten^{1

2}

Affiliations

¹ Virus Genomics, Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.
² Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands.
³ Wellcome Trust Major Overseas Programme, Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam.
⁴ Department of Infection and Immunity, Imperial College London, London, UK.
⁵ Kymab Ltd, Babraham Research Campus, Cambridge, UK.

PMID: 30568804
PMCID: PMC6295324
DOI: 10.1093/ve/vey035

Abstract

The Coronaviridae family of viruses encompasses a group of pathogens with a zoonotic potential as observed from previous outbreaks of the severe acute respiratory syndrome coronavirus and Middle East respiratory syndrome coronavirus. Accordingly, it seems important to identify and document the coronaviruses in animal reservoirs, many of which are uncharacterized and potentially missed by more standard diagnostic assays. A combination of sensitive deep sequencing technology and computational algorithms is essential for virus surveillance, especially for characterizing novel- or distantly related virus strains. Here, we explore the use of profile Hidden Markov Model-defined Pfam protein domains (Pfam domains) encoded by new sequences as a Coronaviridae sequence classification tool. The encoded domains are used first in a triage to identify potential Coronaviridae sequences and then processed using a Random Forest method to classify the sequences to the Coronaviridae genus level. The application of this algorithm on Coronaviridae genomes assembled from agnostic deep sequencing data from surveillance of bats and rats in Dong Thap province (Vietnam) identified thirty-four Alphacoronavirus and eleven Betacoronavirus genomes. This collection of bat and rat coronaviruses genomes provided essential information on the local diversity of coronaviruses and substantially expanded the number of coronavirus full genomes available from bat and rats and may facilitate further molecular studies on this group of viruses.

Keywords: Pfam; machine learning; profile Hidden Markov model; protein domains; random forest; virus classification.

PubMed Disclaimer

Figures

**Figure 1.**
Location of the sampling sites. The right panel shows the map of Vietnam, the left inset shows the Dong Thap province (marked in blue and separated by dotted lines with neighboring provinces within the Mekong Delta region of southern Vietnam). The Mekong Delta river branches and flooding areas are marked in green. Names of communal regions within Dong Thap province are indicated. Locations of the guano farms where bats were samples are marked with red diamonds, and the locations of rat sampling sites are marked in orange triangles.

**Figure 2.**
The distribution of Pfam domains across *Coronaviridae* genera. Panel A: Two examples of *Alpha-, Beta-, Gamma-, Delta-, Toro-* and *Bafinivirus* were selected and all protein domains encoded by the full genomes, detected by profile HMMs, were identified and their positions in each virus genome is marked by colored rectangles. Panel B: The *Coronaviridae* Absolute Triage Domains (CATDs) are marked with an red, the *Coronaviridae* Triage Domains (CTDs) are marked with orange, the frequent Pfam domains are marked in shades of blue, the moderately frequent Pfam domains are marked in shades of green and the rare Pfam domains are marked in gray.

**Figure 3.**
Cluster map of Pfam protein domains encoded by *Coronaviridae* genomes. The protein domain repertoire, as detected by profile HMMs, is plotted as the frequency of each domain in all available full genomes from all *Coronaviridae* genera. Each row represents a protein domain, each column represents a *Coronaviridae* genus. Colors indicate domain frequency within that genus (darkest blue = 1 = all genomes in this genus encode this domain; white = 0 = no genomes in this genus encode this domain, see color bar at upper left).

**Figure 4.**
Sensitivity and specificity plot of various triage conditions. The HMM domain content of the forty-one virus mock contig set (111,577 viral genome fragments including 3,316 *Coronaviridae* fragments) was determined for each fragment. The CTD or CATD domain content plus the contig length (≥500 nt, ≥3,000 nt, ≥10,000 nt, ≥20,000 nt) were used as a triage to classify fragments as ‘*Coronaviridae*’ or ‘not *Coronaviridae*’. The contigs classified as *Coronaviridae* for each triage condition were then identified to the genus level using RF classification. The sensitivity (true positive/true positive + false negative) and specificity (true negative/true negative + false positive) for each combined triage and classification method were determined based on the original identity of the input genomes. Panel A. RF classification after triage by 500 nt or larger and CTD or CATD content. Panel B. As in A but with 3,000 nt or larger contigs. Panel C. As in A but with 10,000 nt or larger contigs. Panel D. As in A but with 20,000 nt or larger contigs. Each colored node represents the outcome of one complete triage/classification cycle, each combined method was repeated five times.

**Figure 5.**
Workflow of the *Coronaviridae* classification tool to identify *Coronaviridae* genomes in NGS data. First, short read NGS data from surveillance samples were *de novo* assembled into larger contigs using SPAdes. Subsequently, putative *Coronaviridae* genome sequences were identified by their encoded triage domains (contig length > 10,000 nt and the presence of at least one CTD) followed by machine learning classification (using RF) to the *Coronaviridae* genus level.

**Figure 6.**
Identification of *Coronaviridae* genomes. *De novo* assembled contigs from rat and bat sample data sets were processed using a triage (contig length > 10,000 nt and the presence of at least one CTD) followed by RF classification to the *Coronaviridae* genus level. About forty-five samples contained *Alphacoronavirus* and *Betacoronavirus* sequences with probabilities > 0.5 (darker blue in the heatmap). These sequences were included in the complete set of samples processed for full genome coronavirus handling. Panel A. Heatmap of predicted *Coronaviridae* genus probabilities. Panel B. Table of probabilities prediction.

**Figure 7.**
Analyses of identified coronavirus genomes. Panel A. Open reading frames and domain content of the three classes of coronavirus identified in this study. All open reading frames > 130 amino acids in length and the Pfam domains are displayed for an example reported genome from each of the lineage 1 and 2 of *Alphacoronavirus*, and *Betacoronavirus* plus the closest known genomes (*Alphacoronavirus Scotophilus* bat CoV 512, NC_009657 and *Betacoronavirus* strain HKU24, NC_026011). Panel B. Maximum-likelihood phylogenetic tree of the spike protein coding sequences from *Alphacoronaviruses* from this study (highlighted in red) plus selected reference sequences. The tree is mid-point rooted for clarity and only bootstraps ≥70 per cent are shown. Horizontal branch lengths are drawn to the scale of nucleotide substitutions per site. Panel C. Maximum-likelihood phylogenetic tree of the spike protein coding sequences from *Betacoronaviruses* plus a collection of spike coding regions from relevant *Betacoronaviruses.* The tree is mid-point rooted for clarity and only bootstraps ≥70 per cent are shown. Horizontal branch lengths are drawn to the scale of nucleotide substitutions per site.

See this image and copyright information in PMC

References

1. Anthony S. J., et al. (2017) ‘Global Patterns in Coronavirus Diversity’, Virus Evolution, 3: vex012. - PMC - PubMed
1. Assiri A., et al. (2013) ‘Hospital Outbreak of Middle East Respiratory Syndrome Coronavirus’, New England Journal of Medicine, 369: 407–16. - PMC - PubMed
1. Bankevich A., et al. (2012) ‘SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing’, Journal of Computational Biology : A Journal of Computational Molecular Cell Biology, 19: 455–77. - PMC - PubMed
1. Boom R., et al. (1990) ‘Rapid and Simple Method for Purification of Nucleic Acids’, Journal of Clinical Microbiology, 28: 495–503. - PMC - PubMed
1. Cotten M., et al. (2014) ‘Full Genome Virus Detection in Fecal Samples Using Sensitive Nucleic Acid Preparation, Deep Sequencing, and a Novel Iterative Sequence Classification Algorithm’, PLoS One, 9: e93269. - PMC - PubMed

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

Affiliations

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources