Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul;2(3):100035.
doi: 10.1016/j.xhgg.2021.100035. Epub 2021 May 11.

A data-driven architecture using natural language processing to improve phenotyping efficiency and accelerate genetic diagnoses of rare disorders

Affiliations

A data-driven architecture using natural language processing to improve phenotyping efficiency and accelerate genetic diagnoses of rare disorders

Jignesh R Parikh et al. HGG Adv. 2021 Jul.

Abstract

Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great potential to enhance reproducibility at scale but suffers from idiosyncrasies in physician notes and other medical records. We developed methods to optimize NLP outputs for automated diagnosis. We filtered NLP-extracted Human Phenotype Ontology (HPO) terms to more closely resemble manually extracted terms and identified filter parameters across a three-dimensional space for optimal gene prioritization. We then developed a tiered pipeline that reduces manual effort by prioritizing smaller subsets of genes to consider for genetic diagnosis. Our filtering pipeline enabled NLP-based extraction of HPO terms to serve as a sufficient replacement for manual extraction in 92% of prospectively evaluated cases. In 75% of cases, the correct causal gene was ranked higher with our applied filters than without any filters. We describe a framework that can maximize the utility of NLP-based phenotype extraction for gene prioritization and diagnosis. The framework is implemented within a cloud-based modular architecture that can be deployed across health and research institutions.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests J.R.P. is the owner and founder of J Square Labs LLC. J.R.P, T.D., P.M., J.R., and S.L. are current or former employees or consultants of Alexion Pharmaceuticals, Inc., and C.Y., R.G., T.W., M.M., A.B., and A.F. are current or former employees of Clinithink Ltd. J.R.P. has consulted for and received compensation from GNS Health-care and TCB Analytics. J.R. is the owner and founder of Latent Strategies, LLC. A.H.B. has received funding from the NIH, MDA (USA), AFM Telethon, Alexion Pharmaceuticals, Inc., Audentes Therapeutics Inc., Dynacure SAS, and Pfizer Inc. He has consulted and received compensation or honoraria from Asklepios BioPharmaceutical, Inc., Audentes Therapeutics, Biogen, F. Hoffman-La Roche AG, GLG, Inc., Guidepoint Global, and Kate Therapeutics and holds equity in Ballard Biologics and Kate Therapeutics. P.B.A. is on the Clinical Advisory Board of Illumina Inc. and GeneDx. C.A.B. has consulted for, and received compensation or honoraria from, Q State Biosciences. All other authors declare no competing interests.

Figures

Figure 1
Figure 1
Study flow diagram Schematic of the overall study design and analysis plan. For each patient in a training set of 52 patients, we employed uniform processes to collect Human Phenotype Ontology (HPO) terms extracted by natural language processing (NLP), manually extracted HPO terms, and exome sequencing (ES) data in the form of variant call files (VCFs). Manually extracted HPO terms were compared to NLP-extracted HPO terms per patient in the training set with respect to (1) frequency of use, (2) HPO term depth within the ontology, and (3) diversity of phenotypic abnormality classes captured, confirming significant differences across all three dimensions. Next, we established thresholds per dimension that were used to create filtered lists of NLP terms per patient. Exomiser was run on each of the filtered NLP term lists (in addition to the manual and unfiltered NLP lists for comparison) per patient, and performance per filter was evaluated using metrics such as area under the receiver operating curve (AUC) and sensitivity. Top-performing NLP filters were combined into a tiered pipeline, which was finally applied to and evaluated on a subsequently ascertained set of 12 patients in the test set, whose data were collected using the same uniform processes described above.
Figure 2
Figure 2
Performance of Exomiser using phenotypes extracted by manual curation versus natural language processing (NLP) among training set cases (A) Receiver operating characteristic curves with sensitivities noted for specificities corresponding to the top 5, 10, and 20 ranked genes, respectively. (B) Box and whiskers plots of distribution of the ranks of the correct genes. Each data point in a distribution corresponds to a specific patient, with lines connecting the ranks of each patient across the two phenotype extraction methods to indicate increase versus decrease in rank. The median and max (worst) ranks are also noted adjacent to the corresponding values in the distributions. (C) Box and whiskers plots of the distribution of the combined Exomiser scores for the correct gene per patient. Each data point in a distribution corresponds to a specific patient, with lines connecting the scores of each patient across the two phenotype extraction methods to indicate increase versus decrease in score. The median scores are noted adjacent to the median values in the distributions.
Figure 3
Figure 3
Comparing features of HPO terms identified by NLP alone versus terms identified by both manual- and NLP-based extraction Box and whiskers plots of (A) distribution of mean frequency percentiles of HPO terms, (B) distribution of mean depth of HPO terms, and (C) distribution of diversity of HPO terms. Each data point in a distribution corresponds to a specific patient in the training set, with lines connecting values of the respective summary feature per patient across the two NLP term subsets to indicate increase versus decrease in value. Mean values per distribition, the difference in means, and associated p-values, calculated using a Wilcoxon's signed-rank test, are noted above each plot.

References

    1. Srivastava S., Love-Nichols J.A., Dies K.A., Ledbetter D.H., Martin C.L., Chung W.K., Firth H.V., Frazier T., Hansen R.L., Prock L., et al. NDD Exome Scoping Review Work Group Meta-analysis and multidisciplinary consensus statement: exome sequencing is a first-tier clinical diagnostic test for individuals with neurodevelopmental disorders. Genet. Med. 2019;21:2413–2421. - PMC - PubMed
    1. Retterer K., Juusola J., Cho M.T., Vitazka P., Millan F., Gibellini F., Vertino-Bell A., Smaoui N., Neidich J., Monaghan K.G., et al. Clinical application of whole-exome sequencing across clinical indications. Genet. Med. 2016;18:696–704. - PubMed
    1. Posey J.E., O’Donnell-Luria A.H., Chong J.X., Harel T., Jhangiani S.N., Coban Akdemir Z.H., Buyske S., Pehlivan D., Carvalho C.M.B., Baxter S., et al. Centers for Mendelian Genomics Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 2019;21:798–812. - PMC - PubMed
    1. Dragojlovic N., Elliott A.M., Adam S., van Karnebeek C., Lehman A., Mwenifumbo J.C., Nelson T.N., du Souich C., Friedman J.M., Lynd L.D. The cost and diagnostic yield of exome sequencing for children with suspected genetic disorders: a benchmarking study. Genet. Med. 2018;20:1013–1021. - PubMed
    1. Trujillano D., Bertoli-Avella A.M., Kumar Kandaswamy K., Weiss M.E., Köster J., Marais A., Paknia O., Schröder R., Garcia-Aznar J.M., Werber M., et al. Clinical exome sequencing: results from 2819 samples reflecting 1000 families. Eur. J. Hum. Genet. 2017;25:176–182. - PMC - PubMed