Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 29;29(3):546-552.
doi: 10.1093/jamia/ocab260.

Curating a longitudinal research resource using linked primary care EHR data-a UK Biobank case study

Affiliations

Curating a longitudinal research resource using linked primary care EHR data-a UK Biobank case study

Philip Darke et al. J Am Med Inform Assoc. .

Abstract

Primary care EHR data are often of clinical importance to cohort studies however they require careful handling. Challenges include determining the periods during which EHR data were collected. Participants are typically censored when they deregister from a medical practice, however, cohort studies wish to follow participants longitudinally including those that change practice. Using UK Biobank as an exemplar, we developed methodology to infer continuous periods of data collection and maximize follow-up in longitudinal studies. This resulted in longer follow-up for around 40% of participants with multiple registration records (mean increase of 3.8 years from the first study visit). The approach did not sacrifice phenotyping accuracy when comparing agreement between self-reported and EHR data. A diabetes mellitus case study illustrates how the algorithm supports longitudinal study design and provides further validation. We use UK Biobank data, however, the tools provided can be used for other conditions and studies with minimal alteration.

Keywords: diabetes mellitus; electronic health records; longitudinal studies; medical record linkage; phenotype.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Common issues in EHR data collection illustrated with synthetic participant data. These resemble realistic participant types, for example, around 70% of UK Biobank participants have data outside of periods of practice registration. Example 1—individual registered with a practice at birth that subsequently adopted an EHR system in the 1990s (prior records are paper-based). Example 2—individual registered with a practice in 1999 but records are also held from a previous period of registration with another practice. Example 3—multiple periods of registration are available from different practices and/or data providers. Example 4—a combination of the above issues. The boxed areas illustrate the inferred periods of data collection using our algorithm.
Figure 2.
Figure 2.
Application of our algorithm to determine periods of complete EHR data collection. The example participant has multiple periods of registration and data outside of registration periods. The boxed areas are the inferred periods of data collection. Further details are included in the Supplementary Materials (Algorithm A1 and Supplementary Figure S1).
Figure 3.
Figure 3.
Example output from the longitudinal phenotyping tool for a synthetic participant. Our algorithm was used to identify periods of complete data collection (top panel). Periods of nondiabetic hyperglycemia (prediabetes), type 2 diabetes, and remission were identified. Periods of medication and biomarkers are also shown. We phenotyped periods of complete data collection to reduce the risk of inaccurately identifying the date of incidence of diabetes. Similar phenotyping approaches using linked EHR data can be used to enforce study criteria or identify more complex endpoints.

References

    1. Wolf A, Dedman D, Campbell J, et al. Data resource profile: Clinical Practice Research Datalink (CPRD) Aurum. Int J Epidemiol 2019; 48 (6): 1740. - PMC - PubMed
    1. Finer S, Martin HC, Khan A, et al. Cohort profile: East London Genes & Health (ELGH), a community-based population genomics and health study in British Bangladeshi and British Pakistani people. Int J Epidemiol 2020; 49 (1): 20–21i. - PMC - PubMed
    1. Koivula RW, Forgie IM, Kurbasic A, et al. Discovery of biomarkers for glycaemic deterioration before and after the onset of type 2 diabetes: descriptive characteristics of the epidemiological studies within the IMI DIRECT Consortium. Diabetologia 2019; 62 (9): 1601–15. - PMC - PubMed
    1. Goldstein BA, Navar AM, Pencina MJ, et al. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. J Am Med Inform Assoc 2017; 24 (1): 198–208. - PMC - PubMed
    1. Sudlow C, Gallacher J, Allen N, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 2015; 12 (3): e1001779. - PMC - PubMed

Publication types