Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 27:5:180273.
doi: 10.1038/sdata.2018.273.

Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records

Affiliations

Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records

Casey N Ta et al. Sci Data. .

Abstract

Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center's Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013-2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1. Columbia Open Health Data (COHD) workflow.
Overall workflow of COHD analysis and application-programming interface (API). We analyzed an Observational Medical Outcomes Partnership (OMOP) database created from Columbia University Irving Medical Center (CUIMC) and New York Presbyterian’s (NYP) clinical data warehouse. We extracted conditions, drugs, procedures, and demographics to calculate prevalence and co-occurrence frequencies. The lifetime dataset used all data while the 5-year dataset only used data from 2013–2017. For patient protection, we excluded concepts with counts ≤ 10 and perturbed the remaining counts using Poisson randomization. The resulting data are stored in a MySQL database and served publicly via the COHD Representational State Transfer (REST) API.
Figure 2
Figure 2. Annual total counts and counts per capita per domain.
Total counts (blue) and counts per capita (orange) of a) condition occurrences, b) drug exposures, c) procedure occurrences, and d) people per year.
Figure 3
Figure 3. Annual demographics prevalence rates.
EHR prevalence per year of a) sex, b) ethnicity, and c) race. c) For visual clarity, the plot excludes races with EHR prevalence < 0.001.
Figure 4
Figure 4. Effect of Poisson randomization.
Absolute percentage difference between Poisson randomized and true counts vs true counts for single concept counts in the lifetime dataset.

References

Data Citations

    1. Ta C. N., Dumontier M., Hripcsak G., Tatonetti N. P., Weng C. 2018. figshare. https://doi.org/10.6084/m9.figshare.c.4151252 - DOI - PMC - PubMed

References

    1. Ross J. S., Lehman R. & Gross C. P. The importance of clinical trial data sharing: toward more open science. Circ. Cardiovasc. Qual. Outcomes 5, 238–240 (2012). - PMC - PubMed
    1. Olson S. & Downey A. S. Sharing Clinical Research Data: Workshop Summary. National Academies Press, (2013). - PubMed
    1. Lo B. Sharing clinical trial data: maximizing benefits, minimizing risk. JAMA 313, 793–794 (2015). - PubMed
    1. Benitez K. & Malin B. Evaluating re-identification risks with respect to the HIPAA privacy rule. J. Am. Med. Inform. Assoc 17, 169–177 (2010). - PMC - PubMed
    1. Ward M. M. Estimating disease prevalence and incidence using administrative data: some assembly required. J. Rheumatol. 40, 1241–1243 (2013). - PMC - PubMed

Publication types

MeSH terms