Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 15;4(2):vey035.
doi: 10.1093/ve/vey035. eCollection 2018 Jul.

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

Affiliations

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains

My V T Phan et al. Virus Evol. .

Abstract

The Coronaviridae family of viruses encompasses a group of pathogens with a zoonotic potential as observed from previous outbreaks of the severe acute respiratory syndrome coronavirus and Middle East respiratory syndrome coronavirus. Accordingly, it seems important to identify and document the coronaviruses in animal reservoirs, many of which are uncharacterized and potentially missed by more standard diagnostic assays. A combination of sensitive deep sequencing technology and computational algorithms is essential for virus surveillance, especially for characterizing novel- or distantly related virus strains. Here, we explore the use of profile Hidden Markov Model-defined Pfam protein domains (Pfam domains) encoded by new sequences as a Coronaviridae sequence classification tool. The encoded domains are used first in a triage to identify potential Coronaviridae sequences and then processed using a Random Forest method to classify the sequences to the Coronaviridae genus level. The application of this algorithm on Coronaviridae genomes assembled from agnostic deep sequencing data from surveillance of bats and rats in Dong Thap province (Vietnam) identified thirty-four Alphacoronavirus and eleven Betacoronavirus genomes. This collection of bat and rat coronaviruses genomes provided essential information on the local diversity of coronaviruses and substantially expanded the number of coronavirus full genomes available from bat and rats and may facilitate further molecular studies on this group of viruses.

Keywords: Pfam; machine learning; profile Hidden Markov model; protein domains; random forest; virus classification.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Location of the sampling sites. The right panel shows the map of Vietnam, the left inset shows the Dong Thap province (marked in blue and separated by dotted lines with neighboring provinces within the Mekong Delta region of southern Vietnam). The Mekong Delta river branches and flooding areas are marked in green. Names of communal regions within Dong Thap province are indicated. Locations of the guano farms where bats were samples are marked with red diamonds, and the locations of rat sampling sites are marked in orange triangles.
Figure 2.
Figure 2.
The distribution of Pfam domains across Coronaviridae genera. Panel A: Two examples of Alpha-, Beta-, Gamma-, Delta-, Toro- and Bafinivirus were selected and all protein domains encoded by the full genomes, detected by profile HMMs, were identified and their positions in each virus genome is marked by colored rectangles. Panel B: The Coronaviridae Absolute Triage Domains (CATDs) are marked with an red, the Coronaviridae Triage Domains (CTDs) are marked with orange, the frequent Pfam domains are marked in shades of blue, the moderately frequent Pfam domains are marked in shades of green and the rare Pfam domains are marked in gray.
Figure 3.
Figure 3.
Cluster map of Pfam protein domains encoded by Coronaviridae genomes. The protein domain repertoire, as detected by profile HMMs, is plotted as the frequency of each domain in all available full genomes from all Coronaviridae genera. Each row represents a protein domain, each column represents a Coronaviridae genus. Colors indicate domain frequency within that genus (darkest blue = 1 = all genomes in this genus encode this domain; white = 0 = no genomes in this genus encode this domain, see color bar at upper left).
Figure 4.
Figure 4.
Sensitivity and specificity plot of various triage conditions. The HMM domain content of the forty-one virus mock contig set (111,577 viral genome fragments including 3,316 Coronaviridae fragments) was determined for each fragment. The CTD or CATD domain content plus the contig length (≥500 nt, ≥3,000 nt, ≥10,000 nt, ≥20,000 nt) were used as a triage to classify fragments as ‘Coronaviridae’ or ‘not Coronaviridae’. The contigs classified as Coronaviridae for each triage condition were then identified to the genus level using RF classification. The sensitivity (true positive/true positive + false negative) and specificity (true negative/true negative + false positive) for each combined triage and classification method were determined based on the original identity of the input genomes. Panel A. RF classification after triage by 500 nt or larger and CTD or CATD content. Panel B. As in A but with 3,000 nt or larger contigs. Panel C. As in A but with 10,000 nt or larger contigs. Panel D. As in A but with 20,000 nt or larger contigs. Each colored node represents the outcome of one complete triage/classification cycle, each combined method was repeated five times.
Figure 5.
Figure 5.
Workflow of the Coronaviridae classification tool to identify Coronaviridae genomes in NGS data. First, short read NGS data from surveillance samples were de novo assembled into larger contigs using SPAdes. Subsequently, putative Coronaviridae genome sequences were identified by their encoded triage domains (contig length > 10,000 nt and the presence of at least one CTD) followed by machine learning classification (using RF) to the Coronaviridae genus level.
Figure 6.
Figure 6.
Identification of Coronaviridae genomes. De novo assembled contigs from rat and bat sample data sets were processed using a triage (contig length > 10,000 nt and the presence of at least one CTD) followed by RF classification to the Coronaviridae genus level. About forty-five samples contained Alphacoronavirus and Betacoronavirus sequences with probabilities > 0.5 (darker blue in the heatmap). These sequences were included in the complete set of samples processed for full genome coronavirus handling. Panel A. Heatmap of predicted Coronaviridae genus probabilities. Panel B. Table of probabilities prediction.
Figure 7.
Figure 7.
Analyses of identified coronavirus genomes. Panel A. Open reading frames and domain content of the three classes of coronavirus identified in this study. All open reading frames > 130 amino acids in length and the Pfam domains are displayed for an example reported genome from each of the lineage 1 and 2 of Alphacoronavirus, and Betacoronavirus plus the closest known genomes (Alphacoronavirus Scotophilus bat CoV 512, NC_009657 and Betacoronavirus strain HKU24, NC_026011). Panel B. Maximum-likelihood phylogenetic tree of the spike protein coding sequences from Alphacoronaviruses from this study (highlighted in red) plus selected reference sequences. The tree is mid-point rooted for clarity and only bootstraps ≥70 per cent are shown. Horizontal branch lengths are drawn to the scale of nucleotide substitutions per site. Panel C. Maximum-likelihood phylogenetic tree of the spike protein coding sequences from Betacoronaviruses plus a collection of spike coding regions from relevant Betacoronaviruses. The tree is mid-point rooted for clarity and only bootstraps ≥70 per cent are shown. Horizontal branch lengths are drawn to the scale of nucleotide substitutions per site.

References

    1. Anthony S. J., et al. (2017) ‘Global Patterns in Coronavirus Diversity’, Virus Evolution, 3: vex012. - PMC - PubMed
    1. Assiri A., et al. (2013) ‘Hospital Outbreak of Middle East Respiratory Syndrome Coronavirus’, New England Journal of Medicine, 369: 407–16. - PMC - PubMed
    1. Bankevich A., et al. (2012) ‘SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing’, Journal of Computational Biology : A Journal of Computational Molecular Cell Biology, 19: 455–77. - PMC - PubMed
    1. Boom R., et al. (1990) ‘Rapid and Simple Method for Purification of Nucleic Acids’, Journal of Clinical Microbiology, 28: 495–503. - PMC - PubMed
    1. Cotten M., et al. (2014) ‘Full Genome Virus Detection in Fecal Samples Using Sensitive Nucleic Acid Preparation, Deep Sequencing, and a Novel Iterative Sequence Classification Algorithm’, PLoS One, 9: e93269. - PMC - PubMed