Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 15;31(12):1981-7.
doi: 10.1093/bioinformatics/btv076. Epub 2015 Feb 4.

Application of clinical text data for phenome-wide association studies (PheWASs)

Affiliations

Application of clinical text data for phenome-wide association studies (PheWASs)

Scott J Hebbring et al. Bioinformatics. .

Abstract

Motivation: Genome-wide association studies (GWASs) are effective for describing genetic complexities of common diseases. Phenome-wide association studies (PheWASs) offer an alternative and complementary approach to GWAS using data embedded in the electronic health record (EHR) to define the phenome. International Classification of Disease version 9 (ICD9) codes are used frequently to define the phenome, but using ICD9 codes alone misses other clinically relevant information from the EHR that can be used for PheWAS analyses and discovery.

Results: As an alternative to ICD9 coding, a text-based phenome was defined by 23 384 clinically relevant terms extracted from Marshfield Clinic's EHR. Five single nucleotide polymorphisms (SNPs) with known phenotypic associations were genotyped in 4235 individuals and associated across the text-based phenome. All five SNPs genotyped were associated with expected terms (P<0.02), most at or near the top of their respective PheWAS ranking. Raw association results indicate that text data performed equivalently to ICD9 coding and demonstrate the utility of information beyond ICD9 coding for application in PheWAS.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An example of unigrams, bigrams, trigrams and 4-grams extracted from the clinical phrase ‘Patient has evidence of macular degeneration’
Fig. 2.
Fig. 2.
Flow diagram of the process taken to identify the 23 384 word strings used to define the text-based phenome
Fig. 3.
Fig. 3.
Chart describing the number of unigrams, bigrams, trigrams and 4-grams in the text-based phenome separated by disease terms (white bars) and drug terms (black bars). Indicated are the numbers in each category
Fig. 4.
Fig. 4.
Manhattan plot graphing −log(P-values) across the text-based phenome for the SNPs known to be associated with (A) ankylosing spondylitis, (B) MS, (C) atrial fibrillation, (D) triglyceride metabolism and (E) age-related macular degeneration. Highlighted are the relevant word strings associated with SNP genotype. See Supplementary Table S2 for all phenotypes with P-values less than 0.001

References

    1. Agarwal S., et al. (2011) BioNOT: a searchable database of biomedical negated sentences. BMC Bioinformatics, 12, 420. - PMC - PubMed
    1. Bodenreider O. (2004) The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res., 32, D267–D270. - PMC - PubMed
    1. Carroll R.J., et al. (2014) R PheWAS: data analysis and plotting tools for phenome-wide association studies in the R environment. Bioinformatics, 30, 2375–2376. - PMC - PubMed
    1. Denny J.C., et al. (2010) PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics, 26, 1205–1210. - PMC - PubMed
    1. Denny J.C., et al. (2011) Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet., 89, 529–542. - PMC - PubMed

Publication types