Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification
- PMID: 27405787
- PMCID: PMC5194214
- DOI: 10.3414/ME15-01-0122
Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification
Abstract
Background: Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized.
Objectives: This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size.
Methods: Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation.
Results: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when meas-ured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator.
Conclusions: Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.
Keywords: Patient data privacy; cost analysis; data sharing; natural language processing.
Figures


Similar articles
-
De-identification of clinical notes in French: towards a protocol for reference corpus development.J Biomed Inform. 2014 Aug;50:151-61. doi: 10.1016/j.jbi.2013.12.014. Epub 2013 Dec 29. J Biomed Inform. 2014. PMID: 24380818
-
RysannMD: A biomedical semantic annotator balancing speed and accuracy.J Biomed Inform. 2017 Jul;71:91-109. doi: 10.1016/j.jbi.2017.05.016. Epub 2017 May 26. J Biomed Inform. 2017. PMID: 28552401
-
Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text.J Biomed Inform. 2014 Aug;50:162-72. doi: 10.1016/j.jbi.2014.05.002. Epub 2014 May 20. J Biomed Inform. 2014. PMID: 24859155 Free PMC article.
-
A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC.J Am Med Inform Assoc. 2015 Sep;22(5):948-56. doi: 10.1093/jamia/ocv037. Epub 2015 May 6. J Am Med Inform Assoc. 2015. PMID: 25948699 Free PMC article.
-
Human-annotated rationales and explainable text classification: a survey.Front Artif Intell. 2024 May 24;7:1260952. doi: 10.3389/frai.2024.1260952. eCollection 2024. Front Artif Intell. 2024. PMID: 38854843 Free PMC article. Review.
Cited by
-
The OpenDeID corpus for patient de-identification.Sci Rep. 2021 Oct 7;11(1):19973. doi: 10.1038/s41598-021-99554-9. Sci Rep. 2021. PMID: 34620985 Free PMC article.
-
Scalable Iterative Classification for Sanitizing Large-Scale Datasets.IEEE Trans Knowl Data Eng. 2017 Mar 1;29(3):698-711. doi: 10.1109/TKDE.2016.2628180. Epub 2016 Nov 11. IEEE Trans Knowl Data Eng. 2017. PMID: 28943741 Free PMC article.
-
Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes.J Biomed Inform. 2017 Nov;75S:S28-S33. doi: 10.1016/j.jbi.2017.06.005. Epub 2017 Jun 7. J Biomed Inform. 2017. PMID: 28602908 Free PMC article.
-
Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada Regulations.Trials. 2020 Feb 18;21(1):200. doi: 10.1186/s13063-020-4120-y. Trials. 2020. PMID: 32070405 Free PMC article.
-
Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.J Am Med Inform Assoc. 2020 Jul 1;27(9):1374-1382. doi: 10.1093/jamia/ocaa095. J Am Med Inform Assoc. 2020. PMID: 32930712 Free PMC article.
References
-
- U.S. Department of Health and Human Services. Standards for Privacy of Individually Identifiable Health Information; Final Rule. Federal Register. 2002:53181–53273. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials