Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis
- PMID: 40262134
- PMCID: PMC12056431
- DOI: 10.2196/58567
Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis
Abstract
Background: There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, such as the GISAID (Global Initiative on Sharing All Influenza Data) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic (the study of evolutionary relationships among biological entities) analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact. While these repositories include fields reflecting patient-related metadata for a given sequence, the inclusion of these demographic and clinical details is scarce. The current understanding of patient-related metadata in published sequencing studies and its quality remains unexplored.
Objective: Our review aims to quantitatively assess the extent and quality of patient-reported metadata in papers reporting original whole genome sequencing of the SARS-CoV-2 virus and analyze publication patterns using bibliometric analysis. Finally, we will evaluate the efficacy and reliability of a machine learning classifier in accurately identifying relevant papers for inclusion in the scoping review.
Methods: The National Institutes of Health's LitCovid collection will be used for the automated classification of papers reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in MEDLINE and PubMed Central for validation. Data extraction will be conducted using Covidence (Veritas Health Innovation Ltd). The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations, citation metrics, author keywords, and Medical Subject Headings terms will be extracted.
Results: This study is expected to be completed in early 2025. Our classification model has been developed and we have classified publications in LitCovid published through February 2023. As of September 2024, papers through August 2024 are being prepared for processing. Screening is underway for validated papers from the classifier. Direct literature searches and screening of the results began in October 2024. We will summarize and narratively describe our findings using tables, graphs, and charts where applicable.
Conclusions: This scoping review will report findings on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in the reporting of patient metadata, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.
Trial registration: OSF Registries osf.io/wrh95; https://doi.org/10.17605/OSF.IO/WRH95.
International registered report identifier (irrid): DERR1-10.2196/58567.
Keywords: COVID-19; GISAID; GenBank; SARS-CoV-2; genomic epidemiology; patient-related metadata; protocol; scoping review; sequence records.
©Karen O'Connor, Davy Weissenbacher, Amir Elyaderani, Ebbing Lautenbach, Matthew Scotch, Graciela Gonzalez-Hernandez. Originally published in JMIR Research Protocols (https://www.researchprotocols.org), 22.04.2025.
Conflict of interest statement
Conflicts of Interest: None declared.
Figures
Update of
-
Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis.medRxiv [Preprint]. 2024 Mar 5:2023.07.14.23292681. doi: 10.1101/2023.07.14.23292681. medRxiv. 2024. Update in: JMIR Res Protoc. 2025 Apr 22;14:e58567. doi: 10.2196/58567. PMID: 37503241 Free PMC article. Updated. Preprint.
Similar articles
-
Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis.medRxiv [Preprint]. 2024 Mar 5:2023.07.14.23292681. doi: 10.1101/2023.07.14.23292681. medRxiv. 2024. Update in: JMIR Res Protoc. 2025 Apr 22;14:e58567. doi: 10.2196/58567. PMID: 37503241 Free PMC article. Updated. Preprint.
-
Text mining biomedical literature to identify extremely unbalanced data for digital epidemiology and systematic reviews: dataset and methods for a SARS-CoV-2 genomic epidemiology study.medRxiv [Preprint]. 2023 Aug 4:2023.07.29.23293370. doi: 10.1101/2023.07.29.23293370. medRxiv. 2023. PMID: 37577535 Free PMC article. Preprint.
-
Experiences of Birth Attendants on Upward Obstetric Emergency Referrals in Low- and Middle-Income Countries: Protocol for a Scoping Review.JMIR Res Protoc. 2025 Apr 10;14:e64886. doi: 10.2196/64886. JMIR Res Protoc. 2025. PMID: 40209211 Free PMC article.
-
Universal screening for SARS-CoV-2 infection: a rapid review.Cochrane Database Syst Rev. 2020 Sep 15;9(9):CD013718. doi: 10.1002/14651858.CD013718. Cochrane Database Syst Rev. 2020. PMID: 33502003 Free PMC article.
-
A Comprehensive Overview of the COVID-19 Literature: Machine Learning-Based Bibliometric Analysis.J Med Internet Res. 2021 Mar 8;23(3):e23703. doi: 10.2196/23703. J Med Internet Res. 2021. PMID: 33600346 Free PMC article. Review.
Cited by
-
Detection of patient metadata in published articles for genomic epidemiology using machine learning and large language models.medRxiv [Preprint]. 2025 Apr 28:2025.04.25.25326298. doi: 10.1101/2025.04.25.25326298. medRxiv. 2025. PMID: 40343027 Free PMC article. Preprint.
References
-
- Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 2017 Mar 30;22(13):30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494. https://europepmc.org/abstract/MED/28382917 30494 - DOI - PMC - PubMed
-
- Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. GenBank. Nucleic Acids Res. 2019;47(D1):D94–D99. doi: 10.1093/nar/gky989. https://europepmc.org/abstract/MED/30365038 5144964 - DOI - PMC - PubMed
-
- Forster P, Forster L, Renfrew C, Forster M. Phylogenetic network analysis of SARS-CoV-2 genomes. Proc Natl Acad Sci U S A. 2020;117(17):9241–9243. doi: 10.1073/pnas.2004999117. https://www.pnas.org/doi/abs/10.1073/pnas.2004999117?url_ver=Z39.88-2003... 2004999117 - DOI - DOI - PMC - PubMed
-
- van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, Owen CJ, Pang J, Tan CC, Boshier FA, Ortiz AT, Balloux F. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect Genet Evol. 2020;83:104351. doi: 10.1016/j.meegid.2020.104351. https://europepmc.org/abstract/MED/32387564 S1567-1348(20)30182-9 - DOI - PMC - PubMed
-
- Tang X, Wu C, Li X, Song Y, Yao X, Wu X, Duan Y, Zhang H, Wang Y, Qian Z, Cui J, Lu J. On the origin and continuing evolution of SARS-CoV-2. Natl Sci Rev. 2020;7(6):1012–1023. doi: 10.1093/nsr/nwaa036. https://europepmc.org/abstract/MED/34676127 nwaa036 - DOI - PMC - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
Research Materials
Miscellaneous