Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 22;22(2):855-872.
doi: 10.1093/bib/bbaa420.

Data science in unveiling COVID-19 pathogenesis and diagnosis: evolutionary origin to drug repurposing

Affiliations

Data science in unveiling COVID-19 pathogenesis and diagnosis: evolutionary origin to drug repurposing

Jayanta Kumar Das et al. Brief Bioinform. .

Abstract

Motivation: The outbreak of novel severe acute respiratory syndrome coronavirus (SARS-CoV-2, also known as COVID-19) in Wuhan has attracted worldwide attention. SARS-CoV-2 causes severe inflammation, which can be fatal. Consequently, there has been a massive and rapid growth in research aimed at throwing light on the mechanisms of infection and the progression of the disease. With regard to this data science is playing a pivotal role in in silico analysis to gain insights into SARS-CoV-2 and the outbreak of COVID-19 in order to forecast, diagnose and come up with a drug to tackle the virus. The availability of large multiomics, radiological, bio-molecular and medical datasets requires the development of novel exploratory and predictive models, or the customisation of existing ones in order to fit the current problem. The high number of approaches generates the need for surveys to guide data scientists and medical practitioners in selecting the right tools to manage their clinical data.

Results: Focusing on data science methodologies, we conduct a detailed study on the state-of-the-art of works tackling the current pandemic scenario. We consider various current COVID-19 data analytic domains such as phylogenetic analysis, SARS-CoV-2 genome identification, protein structure prediction, host-viral protein interactomics, clinical imaging, epidemiological research and drug discovery. We highlight data types and instances, their generation pipelines and the data science models currently in use. The current study should give a detailed sketch of the road map towards handling COVID-19 like situations by leveraging data science experts in choosing the right tools. We also summarise our review focusing on prime challenges and possible future research directions.

Contact: hguzzi@unicz.it, sroy01@cus.ac.in.

Keywords: COVID-19; SARS-CoV-2; artificial intelligence; data science; network science.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The trends of COVID-19 related research publications from two sources: Dimensions [4] and Google Scholar [5] as of 28 September 2020. We searched by using the following keywords: COVID-19, COVID-19 and Machine Learning and COVID-19 and Data Science. The search filter includes published articles, preprints, edited books, monographs, proceedings and chapters.
Figure 2
Figure 2
A data science landscape for SARS-CoV-2 and COVID-19 studies. Many different technologies produce a large quantity of data related to patients at different scales (e.g. molecular data, medical images and clinical data and epidemiological data). The accumulation of this data is the pre-requisite for a substantial rise of data science approaches (e.g. deep-learning and classical data mining) that often integrate existing data stored in databases or a priori knowledge (e.g. domain experts or ontologies). Such approaches produce new information about molecular interactions, phylogenetic analysis, in silico design of drugs or healthcare management decisions. The output may guide the execution of novel experiments closing the loop of the whole process.
Figure 3
Figure 3
The trends of COVID-19-related research articles on five major topics (OMICS, interactome, chest imaging, epidemiology and drug repurposing) based on the search hits from the Google Scholar as of 28 September 2020. The search filter includes published articles, preprints, edited books, monographs, proceedings and chapters.
Figure 4
Figure 4
Major phases of data science pipeline towards decision making and analysis. Data initially collected and integrated from many sources. Then they need to be pre-processed to filter uninformative or possibly misleading values (e.g. outliers or noise). Then existing models are used to explain data or extract relevant patterns describing data or predicting associations. Finally, results need to be interpreted and explained by domain experts. Each step of analysis may generate corrections or refinements that are applied to precedent steps.
Figure 5
Figure 5
Omics data generation and data analysis workflow. Fragments of nucleic acid sequences of the virus, extracted form the host organism, are used as input for data processing algorithms. Many goals of these analyses are (i) analysis of genetic variants, (ii) analysis of genomes of viruses infecting different species, (iii) prediction of protein interactions and interactome and (iv) gather structure and dynamics of viral proteins.
Figure 6
Figure 6
Data integration process to build a host–SARS-CoV-2 Interactome graph. The building of the integrated host–viral interactome starts with the analysis of the viral genome. Then viral proteins and the interactions with host protein are determined. The determination of such interactions is often performed by integrating experimental data with knowledge extracted from literature. Furthermore, protein structures are also predicted. All of this information (structures, virus–host interactions and viral interactions) is integrated by using heterogeneous networks. The final product of the process is an interactome.
Figure 7
Figure 7
Drug repurposing process. The process is based on the integration of molecular data and drug-disease associations. The analysis is often performed by deep-learning or network embedding. The output list of candidate drugs is then confirmed via wet-lab experiments and clinical trials.

References

    1. Apostolopoulos ID, Mpesiana TA. Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med 2020;1. - PMC - PubMed
    1. Bianchi M, Benvenuto D, Giovanetti M, et al. Sars-cov-2 envelope and membrane proteins: structural differences linked to virus characteristics? Biomed Res Int 2020;2020. - PMC - PubMed
    1. Effenberger M, Kronbichler A, Shin JI, et al. Association of the covid-19 pandemic with internet search volumes: a Google trendstm analysis. Int J Infect Dis 2020;95:192–97. - PMC - PubMed
    1. Hook DW, Porter SJ, Herzog C. Dimensions: building context for search and evaluation. Front Res Metr Anal 2018;3:23.
    1. Noruzi A. Google scholar: the new generation of citation indexes. Libri 2005;55(4):170–80.

Publication types

Substances