Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul 23;4(1):116.
doi: 10.1038/s41746-021-00488-3.

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Affiliations

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Danqing Xu et al. NPJ Digit Med. .

Abstract

Labeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. ROC curves cases vs. controls for six phenotypes.
ROC curves of six quantitative disease risk scores along with their AUROCs for a CKD (cases including G1, G2, G3a/b, and G4 stages), b CAD, c T2D, d HF, e Dementia, and f GERD. Quantitative disease risk scores are derived based on all phecodes (PheRS, LPC, and PheNorm), or pre-selected feature phecodes (PheRS.SEL, LPC.SEL, and PheNorm.SEL).
Fig. 2
Fig. 2. Quantitative disease risk scores vs. CKD G-staging.
Boxplots of quantitative disease risk scores a PheRS, b PheRS.SEL, c LPC, d LPC.SEL, e PheNorm, and f PheNorm.SEL. Quantitative disease risk scores are derived based on all phecodes (PheRS, LPC, PheNorm), or 110 pre-selected CKD feature phecodes (PheRS.SEL, LPC.SEL, and PheNorm.SEL). The center line, lower and upper bounds of the box represent the median, first quartile (Q1, or 25th percentile), and third quartile (Q3, or 75th percentile) of the data, respectively. The whisker is drawn up (down) to the largest (smallest) observed point from the data that falls within 1.5 times the interquartile range (= Q3 − Q1) above (below) the Q3 (Q1).
Fig. 3
Fig. 3. Distribution of final LPC scores for six phenotypes in the test set.
LPC risk scores are derived based on all phecodes. Estimated density and distribution of LPC risk scores cases vs. controls for a CKD (cases including G1, G2, G3a/b, and G4 stages), b CAD, c T2D, d HF, e Dementia, and f GERD. For each phenotype: left, distribution of LPC risk scores in the test set. Middle, LPC risk score percentiles among cases vs. controls. Right, case prevalence in 60 bins according to the percentiles of LPC risk scores. The center line, lower and upper bounds of the box represent the median, first quartile (Q1, or 25th percentile), and third quartile (Q3, or 75th percentile) of the data, respectively. The whisker is drawn up (down) to the largest (smallest) observed point from the data that falls within 1.5 times the interquartile range (= Q3 − Q1) above (below) the Q3 (Q1).
Fig. 4
Fig. 4. Distribution of CKD LPC risk scores in the test set vs. test set + individuals with unknown status.
Estimated density and distribution of LPC risk scores cases vs. controls in the a CKD test set and b with unknown status individuals added. LPC risk scores are derived based on all phecodes. Left, distribution of LPC risk scores. Middle, LPC risk score percentiles among cases vs. controls. Right, the prevalence of phenotype in 60 bins according to the percentiles of LPC risk scores. The center line, lower and upper bounds of the box represent the median, first quartile (Q1, or 25th percentile), and third quartile (Q3, or 75th percentile) of the data, respectively. The whisker is drawn up (down) to the largest (smallest) observed point from the data that falls within 1.5 times the interquartile range (= Q3 − Q1) above (below) the Q3 (Q1).
Fig. 5
Fig. 5. Weights for all phecodes used to build PheRS and LPC for six phenotypes.
Scatter plots of weights for a CKD (cases including G1, G2, G3a/b, and G4 stages), b CAD, c T2D, d HF, e Dementia, and f GERD. The weights for the case defining and pre-selected phecodes are highlighted.
Fig. 6
Fig. 6. Exome-wide significant gene-based test results for 107 autosomal genes on the eMERGE-seq panel using LPC with pre-selected phecodes for six diseases.
Results are shown for those phenotypes and ethnic groups with at least one exome-wide significant result: a CKD and European, b CAD and European, c HF and European, d HF and African American, e CKD and Asian. The horizontal line corresponds to the exome-wide significance level.

References

    1. Gottesman O, et al. The electronic medical records and genomics (eMERGE) network: past, present, and future. Genet. Med. 2013;15:761. doi: 10.1038/gim.2013.72. - DOI - PMC - PubMed
    1. McCarty CA, et al. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med. Genomics. 2011;4:1–11. doi: 10.1186/1755-8794-4-13. - DOI - PMC - PubMed
    1. Pulley J, Clayton E, Bernard GR, Roden DM, Masys DR. Principles of human subjects protections applied in an opt-out, de-identified biobank. Clin. Transl. Sci. 2010;3:42–48. doi: 10.1111/j.1752-8062.2010.00175.x. - DOI - PMC - PubMed
    1. Carey DJ, et al. The Geisinger MyCode community health initiative: an electronic health record–linked biobank for precision medicine research. Genet. Med. 2016;18:906. doi: 10.1038/gim.2015.187. - DOI - PMC - PubMed
    1. Murphy, S. N., Mendis, M. E., Berkowitz, D. A., Kohane, I. & Chueh, H. C. Integration of clinical and genetic data in the i2b2 architecture. In AMIA Annual Symposium Proceedings, Vol. 2006, 1040 (American Medical Informatics Association, 2006). - PMC - PubMed

LinkOut - more resources