. 2014 Aug:50:173-183.

doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Affiliations

Affiliation

¹ Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA.

^# Contributed equally.

PMID: 24556292
PMCID: PMC4125487
DOI: 10.1016/j.jbi.2014.01.014

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Louise Deleger et al. J Biomed Inform. 2014 Aug.

. 2014 Aug:50:173-183.

doi: 10.1016/j.jbi.2014.01.014. Epub 2014 Feb 17.

Authors

Affiliation

¹ Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA.

^# Contributed equally.

PMID: 24556292
PMCID: PMC4125487
DOI: 10.1016/j.jbi.2014.01.014

Abstract

Objective: The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development.

Materials and methods: In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes.

Results: Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings.

Discussion and conclusion: We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.

Keywords: Automated de-identification; De-identification gold standard; Health insurance portability and accountability act; Natural Language Processing; Privacy of patient data; Protected Health Information.

PubMed Disclaimer

Figures

**Figure 1**
Example of PHI replacement in a clinical note

**Figure 2**
De-identification performance (F-measure, precision, recall) on the i2b2 test corpus and number of training instances for models trained on various combinations of i2b2, new CCHMC and PhysioNet corpora, by percentage (horizontal axis) of the i2b2 corpus included in the training corpus

**Figure 3**
De-identification performance (F-measure, precision, recall) on the PhysioNet corpus and number of training instances for models trained on various combinations of PhysioNet, new CCHMC and i2b2 corpora, by percentage (horizontal axis) of the PhysioNet corpus included in the training corpus

**Figure 4**
De-identification performance (F-measure, precision, recall) on the original CCHMC corpus (orig. CCHMC= original CCHMC) and number of training instances for models trained on various combinations of original CCHMC, i2b2 and PhysioNet corpora, by percentage (horizontal axis) of the original CCHMC corpus included in the training corpus

See this image and copyright information in PMC

References

1. Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology. 2010;10:70. Epub 2010/08/04. - PMC - PubMed
1. Neamatullah I, Douglass MM, Lehman LW, Reisner A, Villarroel M, Long WJ, et al. Automated de-identification of free-text medical records. BMC medical informatics and decision making. 2008;8:32. Epub 2008/07/26. - PMC - PubMed
1. Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association : JAMIA. 2007;14(5):550–63. Epub 2007/06/30. - PMC - PubMed
1. Beckwith BA, Mahaadevan R, Balis UJ, Kuo F. Development and evaluation of an open source software tool for deidentification of pathology reports. BMC medical informatics and decision making. 2006;6:12. Epub 2006/03/07. - PMC - PubMed
1. Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. American journal of clinical pathology. 2004;121(2):176–86. Epub 2004/02/27. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

5R00LM010227-05/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Affiliation

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous