Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 24:7:1001266.
doi: 10.3389/frma.2022.1001266. eCollection 2022.

Temporal disambiguation of relative temporal expressions in clinical texts

Affiliations

Temporal disambiguation of relative temporal expressions in clinical texts

Amy L Olex et al. Front Res Metr Anal. .

Abstract

Temporal expression recognition and normalization (TERN) is the foundation for all higher-level temporal reasoning tasks in natural language processing, such as timeline extraction, so it must be performed well to limit error propagation. Achieving new heights in state-of-the-art performance for TERN in clinical texts requires knowledge of where current systems struggle. In this work, we summarize the results of a detailed error analysis for three top performing state-of-the-art TERN systems that participated in the 2012 i2b2 Clinical Temporal Relation Challenge, and compare our own home-grown system Chrono to identify specific areas in need of improvement. Performance metrics and an error analysis reveal that all systems have reduced performance in normalization of relative temporal expressions, specifically in disambiguating temporal types and in the identification of the correct anchor time. To address the issue of temporal disambiguation we developed and integrated a module into Chrono that utilizes temporally fine-tuned contextual word embeddings to disambiguate relative temporal expressions. Chrono now achieves state-of-the-art performance for temporal disambiguation of relative temporal expressions in clinical text, and is the only TERN system to output dual annotations into both TimeML and SCATE schemes.

Keywords: BERT; clinical text; contextual word embeddings; error analysis; natural language processing; relative temporal expression; temporal expression recognition and normalization; temporal reasoning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Overview of the fine-tuning, embedding extraction, and classification strategies examined in this work. Baseline BERT models are either the BertBase or ClinBioBert models referenced in the text. (A) No fine-tuning; (B) binary fine-tuning; (C) sequential Binary-Seq2Seq fine-tuning; (D) Seq2Seq fine-tuning.
Figure 2
Figure 2
Chrono's performance on the i2b2 training and evaluation data sets after conversion changes and algorithm improvements using span-based P, R, and F1 metrics.
Figure 3
Figure 3
Performance of top systems from the 2012 i2b2 Temporal Challenge and Chrono on (A) the full evaluation data set, and (B) the subset of poor performing files using span-based P, R, and F1 metrics. RB, Rule-Based; H, Hybrid.
Figure 4
Figure 4
Temporal phrases that were hard to correctly classify as a DURATION or DATE temporal type. Red text indicates an incorrect classification.
Figure 5
Figure 5
Temporal phrases for which it was hard to correctly identify the Anchor Time and/or Delta Value. Red text indicates an incorrect date.
Figure 6
Figure 6
ClinBioBert SVM performance using the Gold Standard RelIV-TIMEX Evaluation data set using class-based P, R, and F1 metrics. Scores are weighted averages across DATE and DURATION. Bold, best performance across all SVM models; orange, high; white, median; blue, low scores relative to all scores in the table.
Figure 7
Figure 7
BertBase SVM performance using the Gold Standard RelIV-TIMEX Evaluation data set using class-based P, R, and F1 metrics. Scores are weighted averages across DATE and DURATION. Bold, best performance across all SVM models; Orange, high; white, median; blue, low scores relative to all scores in the table.
Figure 8
Figure 8
System performance on the RelIV-TIMEX Evaluation data set of Chrono before and after the TTD model integration, and the three i2b2 state-of-the-art system using class-based P, R, and F1 metrics. Values are the weighted average bootstrap estimates across individual DATE and DURATION performance (Supplementary Table 5 contains the 95% confidence intervals). Bold, best performance across all SVM models; orange, high; white, median; blue, low scores with the maximum and minimum relative to each column instead of the entire table.

Similar articles

References

    1. Almasian S., Aumiller D., Gertz M. (2022). BERT got a date: introducing transformers to temporal tagging. arXiv preprint arXiv: 2109.14927. 10.48550/arXiv.2109.14927 - DOI
    1. Alsentzer E., Murphy J. R., Boag W., Weng W.-H., Jin D., Naumann T., et al. . (2019). Publicly available clinical BERT embeddings. arXiv preprint arXiv: 1904.03323. 10.18653/v1/W19-1909 - DOI
    1. Antunes R., Matos S. (2017). Supervised learning and knowledge-based approaches applied to biomedical word sense disambiguation. J. Integr. Bioinform. 14, 20170051. 10.1515/jib-2017-0051 - DOI - PMC - PubMed
    1. Bethard S., Parker J. (2016). A semantically compositional annotation scheme for time normalization, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (Portorož: European Language Resources Association; ), 3779–3786. Available online at: https://aclanthology.org/L16-1599/
    1. Cheng Y., Anick P., Hong P., Xue N. (2013). Temporal relation discovery between events and temporal expressions identified in clinical narrative. J. Biomed. Inform. 46, S48–S53. 10.1016/j.jbi.2013.09.010 - DOI - PubMed

LinkOut - more resources