Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec;58 Suppl(Suppl):S20-S29.
doi: 10.1016/j.jbi.2015.07.020. Epub 2015 Aug 28.

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus

Affiliations

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus

Amber Stubbs et al. J Biomed Inform. 2015 Dec.

Abstract

The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations.

Keywords: Annotation; De-identification; HIPAA; Natural language processing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Categories and sub-categories in the i2b2 de-identification annotation (a version of this figure also appears in Stubbs and Uzuner, 2014a)
Figure 2
Figure 2
Annotation pipeline for de-identification
Figure 3
Figure 3
Sample of clinical text before and after surrogate generation using simplified XML representation

References

    1. Carroll RJ, Thompson WK, Eyler AE, Mandelin AM, Cai T, Zink RM, Pacheco JA, Boomershine CS, Lasko TA, Xu H, Karlson EW, Perez RG, Gainer VS, Murphy SN, Ruderman EM, Pope RM, Plenge RM, Ngo Kho A, Liao KP, Denny JC. Portability of an algorithm to identify rheumatoid arthritis in electronic health records. Journal of the American Informatics Association. 2012 Jun;19(e1):e162–e169. - PMC - PubMed
    1. Deleger L, Lingren T, Ni Y, Kaiser M, Stoutenborough L, Marsolo K, Kouril M, Molnar K, Solti I. Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. Journal of Biomedical Informatics. 2014 in press. - PMC - PubMed
    1. Demner-Fushman D, Chapman WW, McDonald CJ. What can Natural Language Processing do for Clinical Decision Support? Journal of Biomedical Informatics. 2009 Oct;42(5):760–772. - PMC - PubMed
    1. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000 Jun 13;101(23):e215–e220. [Circulation Electronic Pages; http://circ.ahajournals.org/cgi/content/full/101/23/e215] - PubMed
    1. Kumar V, Stubbs A, Shaw S, Uzuner Ö. Creation of a new longitudinal corpus of clinical narratives. This issue. - PMC - PubMed