Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 8;13(1):9319.
doi: 10.1038/s41598-023-35861-7.

Using machine learning to detect coronaviruses potentially infectious to humans

Affiliations

Using machine learning to detect coronaviruses potentially infectious to humans

Georgina Gonzalez-Isunza et al. Sci Rep. .

Abstract

Establishing the host range for novel viruses remains a challenge. Here, we address the challenge of identifying non-human animal coronaviruses that may infect humans by creating an artificial neural network model that learns from spike protein sequences of alpha and beta coronaviruses and their binding annotation to their host receptor. The proposed method produces a human-Binding Potential (h-BiP) score that distinguishes, with high accuracy, the binding potential among coronaviruses. Three viruses, previously unknown to bind human receptors, were identified: Bat coronavirus BtCoV/133/2005 and Pipistrellus abramus bat coronavirus HKU5-related (both MERS related viruses), and Rhinolophus affinis coronavirus isolate LYRa3 (a SARS related virus). We further analyze the binding properties of BtCoV/133/2005 and LYRa3 using molecular dynamics. To test whether this model can be used for surveillance of novel coronaviruses, we re-trained the model on a set that excludes SARS-CoV-2 and all viral sequences released after the SARS-CoV-2 was published. The results predict the binding of SARS-CoV-2 with a human receptor, indicating that machine learning methods are an excellent tool for the prediction of host expansion events.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Methodological workflow of the human Binding Potential (h-BiP) score. Left: preprocessing sequences from alpha and beta coronaviruses. Top: whether the S protein was available from annotation or by extraction from whole-genome, the dataset consists of 2534 unique S protein sequences. Each protein sequence is transformed into a trimer (3 amino acid) representation by sliding a window one amino acid at a time. Bottom: we curated the host field and annotated the sequences according to their binding status to human receptors. Regardless of the host, a virus is considered positive for binding if there is experimental evidence of binding to a human receptor. Right: a skip-gram model uses a neural network to generate trimer embeddings of a fixed dimension (d = 100). These trimer embeddings are numerical vectors that encode information from all neighboring trimers within a context window in the protein sequence. Next, we compute the final sequence embedding (d = 100) by adding up all of its trimer embeddings. The scatterplot shows a visualization for the embeddings from all viruses after using t-distributed stochastic neighbor embedding (tsne) to reduce dimensionality. Finally, all sequence embeddings feed a classifier (logistic regression) to produce the h-BiP score that learns from the binding information of alpha and beta coronaviruses. An h-BiP score greater than or equal to 0.5 flags the virus as likely for human binding.
Figure 2
Figure 2
Comparison of sequence % identity and h-BiP score for alpha and beta coronaviruses. The x-axis represents the maximum % identity computed from a particular virus against the seven known human coronaviruses. The y-axis shows the h-BiP score. Each point in the graph represents a sequence in the dataset. Regardless of their host, red crosses depict sequences of viruses known to bind a human receptor, and grey points represent those viruses not known to do so. Points above the blue dashed horizontal line have a h-BiP score greater or equal than 0.5 (i.e. positive for binding). The blue dashed vertical line is the 97% identity reference line. The spike protein of bat coronavirus RaTG13 (depicted with a red star) is known to bind to human receptor hACE2 and it has a 97.46% amino acid identity against SARS-CoV-2 and a 0.999 h-BiP score. Three viruses with h-BiP ≥ 0.5 and yet unknown binding, Bt133, HKU5r and LYRa3 are highlighted.
Figure 3
Figure 3
Phylogenetic tree for viruses related to Bt133, HKU5r and LYRa3 at the S gene. Pruned version of maximum-clade-credibility tree generated from 424 alpha and beta coronaviruses (full tree available in Supplementary Fig. S2). Each leaf shows the host, the name of the virus and the binding status separated by a pipe symbol. Non-human viruses with published binding annotation to a human receptor and human viruses have a binding status of 1 (0 otherwise). Solid gray triangles at the left of a leaf represent multiple variants in the particular leaf. (a) Phylogenetic tree for the Merbecovirus subgenus. Bat coronavirus Bt133 and HKU5r are phylogenetically related to Ty-HKU4 and HKU5r respectively. (b) Phylogenetic tree for the Sarbecovirus subgenus. LYRa3 is phylogenetically related to LYRa11.
Figure 4
Figure 4
Multiple sequence alignment for phylogenetically related viruses at the RBM. Multiple sequence alignment was performed with MUSCLE and produce visualizations with Jalview. A darker shade shows residues conserved in at least 50% of the sequences. (a) Comparison of viruses related to Bt133 within the Merbecovirus subgenus. Ty-HKU4 and MERS are viruses known to bind human receptor hDPP4. Experimental studies found no evidence of binding from HKU5 to hDPP4. Bt133 conserves all contact residues used by Ty-HKU4 to bind hDPP4 in 24 (marked with a pink asterisk). MERS uses four of the same contact residues than Ty-HKU4 (indicated by a blue triangle). HKU5, the only virus in the list unable to bind hDPP4, does not share any of the 8 contact residues from Ty-HKU4, and it shows several deletions at the RBM. (b) Comparison of viruses related to LYRa3 within the Sarbecovirus subgenus. LYRa11 is phylogenetically related to SARS-CoV and there is experimental evidence of binding from both to human receptor hACE2. LYRa11 conserves 12 (out of 17) of the contact residues used by SARS-CoV, (marked with a pink asterisk). At the RBM, LYRa3 differs from LYRa11 only at H441, which is not a contact residue used by SARS-CoV. Experimental studies found no evidence of binding from ZC45 to hACE2. ZC45 conserves only 2 out of the 17 contact residues from SARS-CoV, and it shows several deletions at the RBM.
Figure 5
Figure 5
Most frequent contact residues for Ty-HKU4 and Bt133. The RBD is shown in light blue and the DPP4 human receptor in grey. Residues involved in frequent (average ≥  45% in Supplementary Tables S4 and S5) H-bonds are depicted in different colors. (a) Frequent contact residues for Ty-HKU4 are E518 (magenta), N514 (red), K506 (purple) and K547 (orange). (b) Frequent contact residues for Bt133 are E518 (magenta), N514 (red) and Q515 (yellow).

Similar articles

Cited by

References

    1. Cui J, Li F, Shi ZL. Origin and evolution of pathogenic coronaviruses. Nat. Rev. Microbiol. 2019;17(3):181–192. doi: 10.1038/s41579-018-0118-9. - DOI - PMC - PubMed
    1. Naguib MM, Ellström P, Järhult JD, Lundkvist Å, Olsen B. Towards pandemic preparedness beyond COVID-19. The Lancet Microbe. 2020;1(5):e185–e186. doi: 10.1016/S2666-5247(20)30088-4. - DOI - PMC - PubMed
    1. Olival KJ, Hosseini PR, Zambrana-Torrelio C, Ross N, Bogich TL, Daszak P. Host and viral traits predict zoonotic spillover from mammals. Nature. 2017;546(7660):646–650. doi: 10.1038/nature22975. - DOI - PMC - PubMed
    1. Plowright RK, et al. Pathways to zoonotic spillover. Nat. Rev. Microbiol. 2017;15(8):502–510. doi: 10.1038/nrmicro.2017.45. - DOI - PMC - PubMed
    1. Rodriguez-Morales AJ, et al. History is repeating itself: Probable zoonotic spillover as the cause of the 2019 novel Coronavirus Epidemic. Infez. Med. 2020;28(1):3–5. - PubMed

Publication types