. 2021 Nov:211:106394.

doi: 10.1016/j.cmpb.2021.106394. Epub 2021 Sep 6.

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

Affiliations

¹ Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK.
² Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
³ Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
⁴ Fundació Institut Universitari per a la recerca a ľAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain.
⁵ Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.
⁶ Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea; Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea.
⁷ Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
⁸ Departments of Biomathematics, University of California, Los Angeles, USA.
⁹ Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Republic of Korea.
¹⁰ Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA. Electronic address: jreps@its.jnj.com.

PMID: 34560604
PMCID: PMC8420135
DOI: 10.1016/j.cmpb.2021.106394

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

Sara Khalid et al. Comput Methods Programs Biomed. 2021 Nov.

. 2021 Nov:211:106394.

doi: 10.1016/j.cmpb.2021.106394. Epub 2021 Sep 6.

Authors

Affiliations

¹ Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK.
² Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
³ Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
⁴ Fundació Institut Universitari per a la recerca a ľAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain.
⁵ Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.
⁶ Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea; Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea.
⁷ Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
⁸ Departments of Biomathematics, University of California, Los Angeles, USA.
⁹ Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Republic of Korea.
¹⁰ Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA. Electronic address: jreps@its.jnj.com.

PMID: 34560604
PMCID: PMC8420135
DOI: 10.1016/j.cmpb.2021.106394

Abstract

Background and objective: As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).

Methods: We show step-by-step how to implement the analytics pipeline for the question: 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.

Results: Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.

Conclusion: Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

Keywords: COVID-19; Data harmonization; Data quality control; Distributed data network; Machine learning; Risk prediction.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest CB, MJS, AGS, JMR are employees of Janssen Research & Development and shareholders of Johnson & Johnson.

Figures

Fig 1 — **Fig. 1**
The OHDSI distributed data network. As of November 2020, it includes 22 sites spread across North America, Europe, and Asia that have COVID-19 patient data mapped to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM).

Fig 2 — **Fig. 2**
An overview of the OHDSI analytics pipeline for patient-level prediction modelling. Orange boxes represent study-specific input or output, blue boxes represent non-study-specific input, output, or OHDSI software tools.

Fig 3 — **Fig. 3**
The step-by-step process for mapping data sources to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM), using OHDSI software tools. ETL: Extraction, Transformation and Load; DQD: Data Quality Dashboard.

Fig 4 — **Fig. 4**
Prediction problem specification in OHDSI.

Fig 5 — **Fig. 5**
A snapshot of the CohortDiagnostics tool for assessing phenotypes. Here, Optum SES refers to the Optum Claims database.

Fig 6 — **Fig. 6**
A snapshot of the ATLAS tool for prediction model development.

Fig 7 — **Fig. 7**
Calibration performance for internal validation of the L1-regularized logistic regression model for predicting 30-day death outcome in patients hospitalized with COVID-19 on Optum Claims data, overall (left panels) and by age and gender (right panels).

Fig 8 — **Fig. 8**
Calibration performance for external validation of the L1-regularized logistic regression model for predicting 30-day death outcome in patients hospitalized with COVID-19 on SIDIAP data, overall (left panels) and by age and gender (right panels).

Fig 9 — **Fig. 9**
A snapshot of the viewer dashboard. It contains the model summary, model performance, and all model settings.

Fig 10 — **Fig. 10**
A snapshot of a Model Table in the Viewer Dashboard. It contains the complete model specification including intercept term and coefficient values for each covariate included in the final model.

See this image and copyright information in PMC

References

1. World Health Organization . COVID-19 weekly epidemiological update, edition 45, 22 June 2021. World Health Organization [Online]; 2021. https://apps.who.int/iris/handle/10665/342009 [Online]. Available:
1. Collins G.S., Reitsma J.B., Altman D.G., Moons K.G. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Circulation. 2015;131(2):211–219. doi: 10.1186/s12916-014-0241-z. - DOI - PMC - PubMed
1. Al-Najjar H., Al-Rousan N. A classifier prediction model to predict the status of coronavirus COVID-19 patients in South Korea. Eur. Rev. Med. Pharmacol. Sci. 2020;24(6):3400–3403. doi: 10.26355/eurrev_202003_20709. - DOI - PubMed
1. Shi Y., Yu X., Zhao H., Wang H., Zhao R., Sheng J. Host susceptibility to severe COVID-19 and establishment of a host risk score: findings of 487 cases outside Wuhan. Crit. Care. 2020;24(1):108. doi: 10.1186/s13054-020-2833-7. Mar 18. - DOI - PMC - PubMed
1. Wynants L., et al. Prediction models for diagnosis and prognosis of COVID-19 infection: systematic review and critical appraisal. BMJ. 2020;369:m1328. doi: 10.1136/bmj.m1328. Apr. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

Affiliations

A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Molecular Biology Databases

Research Materials