Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Dec 21;5(1):186.
doi: 10.1038/s41746-022-00730-6.

A survey on clinical natural language processing in the United Kingdom from 2007 to 2022

Affiliations
Review

A survey on clinical natural language processing in the United Kingdom from 2007 to 2022

Honghan Wu et al. NPJ Digit Med. .

Abstract

Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.

PubMed Disclaimer

Conflict of interest statement

R.S. declares research support received in the last 3 years, from Janssen, GSK and Takeda. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The scope of this study is composed of two main parts.
a A UK community survey (the lower oval); and b a literature review of the community’s research outputs (the upper oval). *NHS—National Health Service in the UK; RL/ML/LLM—NLP technologies of rule-based, machine learning and large language models.
Fig. 2
Fig. 2. Snapshots (force-directed visualisations) of the community from 2012 to 2022.
The graphs contain four types of entities: projects, persons, organisations and funders. Each graph is constructed using data from projects with a start date earlier than or in the given year. Graph data is cumulative, meaning a later year’s data is a superset of its previous years. The size of organisation nodes indicates the number of total amounts in pound sterling they received in funding.
Fig. 3
Fig. 3. Histogram of person nodes Eigenvector centrality scores.
The x-axis is the eigenvector centrality score and the y-axis (log scale) is the number of people with certain scores.
Fig. 4
Fig. 4. Trends in the last 15 years on budgets of all clinical NLP projects, those involving NHS and those involving industry organisations.
Each tick on the x-axis is a 3-year period. The y-axis shows the total budget. The sums of NHS involved and industry involved project budgets are plotted alongside the budget of all projects across five 3-year periods.
Fig. 5
Fig. 5. The development of studentship projects in clinical NLP from 2016 to 2021.
The three figures (from left to right) show the networks of studentship projects and their associated entities (funders, organisations and persons) for 2016, 2017 and 2021 respectively. The 2021 entire network is too big to be shown fully using the same scale. Therefore, a low-resolution overview is shown at the top right and a snapshot of it is displayed using the same scale as other years.
Fig. 6
Fig. 6. NLP tasks versus technical objectives.
The x-axis is the categories of NLP tasks and the y-axis is the technical objectives. The size of the circles denotes the number of publications.
Fig. 7
Fig. 7. NLP algorithm type breakdown and their development trends over the last 15 years.
The main bar chart shows the changes of different NLP algorithms in the last 15 years. The pie chart at the top left corner depicts the overall breakdown of algorithms of all research work analysed.
Fig. 8
Fig. 8. Knowledge representation and distributed representations.
The pie chart at the left shows the breakdown of representation techniques. For ontologies, the bar chart on the right depicts the top five frequently used ontologies in clinical NLP applications.
Fig. 9
Fig. 9. Information collection and data extraction.
Step 1: Data were collected for funded clinical NLP projects by querying three searchable datasets from UK and EU funding bodies and downloading project data from UK charities such as British Heart Foundation and Cancer Research UK. Step 2: Data were extracted to obtain metadata of projects and their associated entities.
Fig. 10
Fig. 10. Flow chart describing publication identification for clinical NLP literature review.
We started with 431 extracted publications, out of which 361 have sufficient information suitable for screening. The title/abstract screen further removed 202 papers which were deemed irrelevant. This left us 159 publications for an eligibility assessment using inclusion/exclusion criteria on their full text. After this final check step, 107 publications were included for the final review.

Comment in

References

    1. Murdoch TB, Detsky AS. The inevitable application of big data to health care. J. Am. Med. Assoc. 2013;309:1351–1352. doi: 10.1001/jama.2013.393. - DOI - PubMed
    1. Zhang D, Yin C, Zeng J, Yuan X, Zhang P. Combining structured and unstructured data for predictive models: a deep learning approach. BMC Med. Inform. Decis. Mak. 2020;20:1–11. doi: 10.1186/s12911-020-01297-6. - DOI - PMC - PubMed
    1. Vest JR, Grannis SJ, Haut DP, Halverson PK, Menachemi N. Using structured and unstructured data to identify patients’ need for services that address the social determinants of health. Int. J. Med. Inform. 2017;107:101–106. doi: 10.1016/j.ijmedinf.2017.09.008. - DOI - PubMed
    1. Wu H, et al. Semehr: a general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. J. Am. Med. Inform. Assoc. 2018;25:530–537. doi: 10.1093/jamia/ocx160. - DOI - PMC - PubMed
    1. Kharrazi H, et al. The value of unstructured electronic health record data in geriatric syndrome case identification. J. Am. Geriatr. Soc. 2018;66:1499–1507. doi: 10.1111/jgs.15411. - DOI - PubMed