. 2019 Jul;21(7):1585-1593.

doi: 10.1038/s41436-018-0381-1. Epub 2018 Dec 5.

ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis

Cole A Deisseroth¹, Johannes Birgmeier¹, Ethan E Bodle², Jennefer N Kohler³, Dena R Matalon², Yelena Nazarenko⁴, Casie A Genetti⁵, Catherine A Brownstein⁵, Klaus Schmitz-Abe⁵, Kelly Schoch⁶, Heidi Cope⁶, Rebecca Signer⁷; Undiagnosed Diseases Network; Julian A Martinez-Agosto^{7

8

9}, Vandana Shashi⁶, Alan H Beggs⁵, Matthew T Wheeler^{3

10}, Jonathan A Bernstein¹¹, Gill Bejerano^{12

13

14

15}

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA, USA.
² Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA.
³ Stanford Center for Undiagnosed Diseases, Stanford, CA, USA.
⁴ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
⁵ The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA.
⁶ Department of Pediatrics, Duke University School of Medicine, Durham, NC, USA.
⁷ Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
⁸ Department of Pediatrics, Division of Medical Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
⁹ Department of Psychiatry, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
¹⁰ Department of Medicine, Stanford School of Medicine, Stanford, CA, USA.
¹¹ Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA. Jon.Bernstein@stanford.edu.
¹² Department of Computer Science, Stanford University, Stanford, CA, USA. bejerano@stanford.edu.
¹³ Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA. bejerano@stanford.edu.
¹⁴ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. bejerano@stanford.edu.
¹⁵ Department of Developmental Biology, Stanford University, Stanford, CA, USA. bejerano@stanford.edu.

PMID: 30514889
PMCID: PMC6551315
DOI: 10.1038/s41436-018-0381-1

ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis

Cole A Deisseroth et al. Genet Med. 2019 Jul.

. 2019 Jul;21(7):1585-1593.

doi: 10.1038/s41436-018-0381-1. Epub 2018 Dec 5.

Authors

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA, USA.
² Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA.
³ Stanford Center for Undiagnosed Diseases, Stanford, CA, USA.
⁴ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
⁵ The Manton Center for Orphan Disease Research, Division of Genetics and Genomics, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA.
⁶ Department of Pediatrics, Duke University School of Medicine, Durham, NC, USA.
⁷ Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
⁸ Department of Pediatrics, Division of Medical Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
⁹ Department of Psychiatry, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA.
¹⁰ Department of Medicine, Stanford School of Medicine, Stanford, CA, USA.
¹¹ Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA. Jon.Bernstein@stanford.edu.
¹² Department of Computer Science, Stanford University, Stanford, CA, USA. bejerano@stanford.edu.
¹³ Department of Pediatrics, Stanford School of Medicine, Stanford, CA, USA. bejerano@stanford.edu.
¹⁴ Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. bejerano@stanford.edu.
¹⁵ Department of Developmental Biology, Stanford University, Stanford, CA, USA. bejerano@stanford.edu.

PMID: 30514889
PMCID: PMC6551315
DOI: 10.1038/s41436-018-0381-1

Abstract

Purpose: Diagnosing monogenic diseases facilitates optimal care, but can involve the manual evaluation of hundreds of genetic variants per case. Computational tools like Phrank expedite this process by ranking all candidate genes by their ability to explain the patient's phenotypes. To use these tools, busy clinicians must manually encode patient phenotypes from lengthy clinical notes. With 100 million human genomes estimated to be sequenced by 2025, a fast alternative to manual phenotype extraction from clinical notes will become necessary.

Methods: We introduce ClinPhen, a fast, high-accuracy tool that automatically converts clinical notes into a prioritized list of patient phenotypes using Human Phenotype Ontology (HPO) terms.

Results: ClinPhen shows superior accuracy and 20× speedup over existing phenotype extractors, and its novel phenotype prioritization scheme improves the performance of gene-ranking tools.

Conclusion: While a dedicated clinician can process 200 patient records in a 40-hour workweek, ClinPhen does the same in 10 minutes. Compared with manual phenotype extraction, ClinPhen saves an additional 3-5 hours per Mendelian disease diagnosis. Providers can now add ClinPhen's output to each summary note attached to a filled testing laboratory request form. ClinPhen makes a substantial contribution to improvements in efficiency critically needed to meet the surging demand for clinical diagnostic sequencing.

Keywords: Mendelian disease diagnosis; medical genetics; natural language processing; prioritized disease phenotypes.

PubMed Disclaimer

Conflict of interest statement

DISCLOSURE

The authors declare no conflicts of interest.

Figures

**Fig. 1. Steps to diagnose a patient with a Mendelian disease using automated gene-ranking algorithms.**
The patient’s genotypic information is encoded using standard formats (variant call format [VCF] file, candidate causative gene list) and a list of patient phenotypes encoded as ontology terms. Extensive tool support exists for obtaining candidate causative variants and genes from an exome sequence. Tool support for obtaining an appropriate list of encoded patient phenotypes from the patient’s clinical notes is limited. Encoded phenotypes are currently acquired by manually reading through the patient’s clinical notes and recording the phenotypes found as their IDs in a phenotype ontology. We introduce ClinPhen, a tool that automates phenotype extraction from clinical notes, optimized to accelerate patient diagnosis. *SNV* single-nucleotide variant.

**Fig. 2. ClinPhen sentence analysis process.**
ClinPhen splits the clinical notes into sentences, and those sentences into subsentences. It then finds phenotypes whose synonyms appear in the subsentences. A high-precision, high-sensitivity, rule-based natural language processing system decides which phenotypes correspond to true mentions and which are false positives. Because the third sentence contains the flag word “father,” for instance, it is assumed that this sentence does not refer to the patient, and any phenotype synonyms found in the sentence will not be associated with the patient. ClinPhen sorts the identified phenotypes first by how many times they appeared in the set of notes (descending), then by the index of the first subsentence in which they were found (ascending), and then by Human Phenotype Ontology (HPO) ID (ascending).

**Fig. 3. Performance of all extraction methods.**
(a) Comparison of the extractors’ precision and phenotype sensitivity (higher bars mean higher accuracy). We compared the average precision and sensitivity values of ClinPhen, cTAKES, and MetaMap, using patients from the Stanford test set as subjects, and the All set (all of the phenotypes found manually and confirmed by a physician to apply to the patient) as the correct phenotypes. The average (column) and 95% confidence interval (calculated using bootstrapping with 1000 trials) of the precision and sensitivity values across all patients are displayed for each extractor. ClinPhen achieves the highest average precision and sensitivity. (b) Causative gene-ranking performance of each gene-ranking tool when run with different numbers of phenotypes returned by ClinPhen (lower number means better causative gene rankings). ClinPhen was run on the clinical notes of the Stanford test set, and the gene-ranking tools were called with the patient’s genetic information and the n highest-priority (most-mentioned, first-occurring) extracted phenotypes, with n running from 1 to 100 inclusive. The average causative gene rank across all patients was taken for each phenotype count limit (n)/gene-ranking tool pairing. The better-performing gene-ranking algorithms rank the causative gene higher when run with a few (around 3) high-priority phenotypes than with all extracted phenotypes. (c) Phrank’s causative gene-ranking performance across all extraction methods (lower numbers mean better causative gene rankings). We compared the causative gene ranks obtained by running Phrank on the Stanford test set with various extracted sets of phenotypes (All manually found, physician-verified phenotypes [*All*] versus a subset of phenotypes considered by a physician to be useful for diagnosis *[Clinician]* versus automatically extracted phenotypes using various methods). Phrank ranks are sorted lowest to highest for each extractor. Phrank performs better when run with ClinPhen’s 3 highest-priority phenotypes (the most-mentioned, earliest-occurring phenotypes in a patient’s clinical notes) than when run with other phenotype sets, manually or automatically extracted. (d) Extractor runtime comparison on each patient (lower number means faster runtime). We measured the runtime of each extractor (ClinPhen, cTAKES, and MetaMap) on each patient’s clinical notes, in seconds. For each patient, we also measured the time three clinicians took to manually scan through the same notes read by the automatic extractors, and encode the phenotypes considered useful for diagnosis. Each data point is one patient whose clinical notes were scanned by one of the extractors (or clinicians). The horizontal position is the total number of words in the patient’s clinical notes. The vertical position is the time taken for the extractor to run on the notes (logarithmically scaled). While MetaMap’s runtime scales linearly and cTAKES’ runtime scales exponentially with the total length of the clinical notes, ClinPhen runs in near-constant time, and is 15–20× faster than the next fastest tool. All automatic extraction tools are much faster than manual extraction.

**Fig. 4. Replication with patient data from three additional centers.**
The same test used to generate Fig. 3c (running Phrank on each patient’s data, given each extracted set of phenotypes, and then sorting the causative gene ranks) was performed using (a) Manton Center patients, (b) Duke Undiagnosed Disease Network (UDN) patients (optical character recognized [OCRed] without manual correction from PDF), and (c) University of California-Los Angeles (UCLA) UDN patients (which had a single consult clinical note per patient) to evaluate the performance of the automatic extractors (ClinPhen, cTAKES, MetaMap). ClinPhen (red line) outperforms other automatic phenotype extractors when its phenotypes are used as input to automatic gene-ranking algorithms (as it did with the Stanford test set).

See this image and copyright information in PMC

References

1. Church G Compelling reasons for repairing human germlines. N Engl J Med. 2017;377:1909–1911. - PubMed
1. Jagadeesh KA, Wenger AM, Berger MJ, et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet. 2016;48:1581. - PubMed
1. Dewey FE, Grove ME, Pan C, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311:1035–1045. - PMC - PubMed
1. Stephens ZD, Lee SY, Faghri F, et al. Big data: astronomical or genomical? PLoS Biol. 2015;13:e1002195. - PMC - PubMed
1. Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164–e164. - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- ClinicalTrials.gov
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis

Affiliations

ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical