Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Aug 5;55(4):356-64.
doi: 10.3414/ME15-01-0122. Epub 2016 Jul 13.

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

Affiliations

Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification

David S Carrell et al. Methods Inf Med. .

Abstract

Background: Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized.

Objectives: This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size.

Methods: Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation.

Results: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when meas-ured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator.

Conclusions: Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.

Keywords: Patient data privacy; cost analysis; data sharing; natural language processing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of a clinical note (A) as displayed in the EHR with actual PII underlined, and (B) as extracted from the EHR with color highlighting representing annotations that are correct (green), too long (blue) too short (yellow) or missing (underlined without highlighting).
Figure 2
Figure 2
Median number of newly discovered PII instances discovered by annotator teams of increasing size (individuals, simulated teams of 2, 3, or 4) and salary cost per additional PII instance discovered in 2014 dollars.

Similar articles

Cited by

References

    1. U.S. Department of Health and Human Services. Standards for Privacy of Individually Identifiable Health Information; Final Rule. Federal Register. 2002:53181–53273. - PubMed
    1. Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol. 2010;10:70. Epub 2010/08/04. - PMC - PubMed
    1. Dehghan A, Kovacevic A, Karystianis G, Keane JA, Nenadic G. Combining knowledge- and data-driven methods for de-identification of clinical narratives. J Biomed Inform. 2015;58(Suppl):S53–S59. Epub 2015/07/27. - PMC - PubMed
    1. Stubbs A, Kotfila C, Uzuner O. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. J Biomed Inform. 2015 Epub 2015/08/01. - PMC - PubMed
    1. Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14(5):550–563. Epub 2007/06/30. - PMC - PubMed

MeSH terms