Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 15;39(6):773-800.
doi: 10.1002/sim.8445. Epub 2019 Dec 20.

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

Affiliations

The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities

Lauren J Beesley et al. Stat Med. .

Abstract

Biobanks linked to electronic health records provide rich resources for health-related research. With improvements in administrative and informatics infrastructure, the availability and utility of data from biobanks have dramatically increased. In this paper, we first aim to characterize the current landscape of available biobanks and to describe specific biobanks, including their place of origin, size, and data types. The development and accessibility of large-scale biorepositories provide the opportunity to accelerate agnostic searches, expedite discoveries, and conduct hypothesis-generating studies of disease-treatment, disease-exposure, and disease-gene associations. Rather than designing and implementing a single study focused on a few targeted hypotheses, researchers can potentially use biobanks' existing resources to answer an expanded selection of exploratory questions as quickly as they can analyze them. However, there are many obvious and subtle challenges with the design and analysis of biobank-based studies. Our second aim is to discuss statistical issues related to biobank research such as study design, sampling strategy, phenotype identification, and missing data. We focus our discussion on biobanks that are linked to electronic health records. Some of the analytic issues are illustrated using data from the Michigan Genomics Initiative and UK Biobank, two biobanks with two different recruitment mechanisms. We summarize the current body of literature for addressing these challenges and discuss some standing open problems. This work complements and extends recent reviews about biobank-based research and serves as a resource catalog with analytical and practical guidance for statisticians, epidemiologists, and other medical researchers pursuing research using biobanks.

Keywords: Michigan Genomics Initiative; UK Biobank; biobanks; electronic health records; selection bias.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Flowchart of Study Planning, Design and Analysis
Figure 2:
Figure 2:
Boxplots of Ratio of PheWAS Code Prevalences in MGI vs. UK Biobank Across Phenome
Figure 3:
Figure 3:
Relationship between (a) Anxiety or (b) Heart Attack Diagnosis and Length of Follow-up within Age Strata in MGI* * Plotted intervals indicate 95% confidence intervals for each proportion.
Figure 4:
Figure 4:
Impact of Selection Mechanism and Phenotype Misclassification on Estimated Association between Gender and Cancer Diagnosis in MGI* *95% confidence intervals
Figure 5:
Figure 5:
Comparison of GWAS Results in MGI and UK Biobank for Selected Cancer Phenotypes* * Each point represents a SNP identified as being related to the corresponding phenotype in the NHGRI-EBI GWAS catalog. The point location corresponds to the log-odds ratio association between the SNP and the phenotype of interest in MGI and UK Biobank. The two lines correspond to equality of the estimates and a fitted line to the points (excluding any outlying points with absolute log-OR greater than 0.6). “Spearman” indicates the Spearman correlation and “CCC” indicates Lin’s concordance correlation coefficient, which is a measure of agreement (with 1 being perfect agreement).

References

    1. De Souza YG & Greenspan JS Biobanking past, present and future. AIDS 27, 303–312 (2013). - PMC - PubMed
    1. Greely HT The Uneasy Ethical and Legal Underpinnings of Large-Scale Genomic Biobanks. Annu. Rev. Genomics Hum. Genet 8, 343–364 (2007). - PubMed
    1. Hayrinen K, Saranto K & Nyk P Definition, structure, content, use and impacts of electronic health records: A review of the research literature. Int. J. Med. Inform 7, 291–304 (2008). - PubMed
    1. Denny JC et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics 26, 1205–1210 (2010). - PMC - PubMed
    1. Wolford BN, Willer CJ & Surakka I Electronic health records: The next wave of complex disease genetics. Hum. Mol. Genet 27, R14–R21 (2018). - PMC - PubMed

Publication types