Automated redaction of names in adverse event reports using transformer-based neural networks

Eva-Lisa Meldau¹, Shachi Bista², Carlos Melgarejo-González², G Niklas Norén²

Affiliations

PMID: 39716217
PMCID: PMC11668006
DOI: 10.1186/s12911-024-02785-9

Automated redaction of names in adverse event reports using transformer-based neural networks

Eva-Lisa Meldau et al. BMC Med Inform Decis Mak. 2024.

. 2024 Dec 23;24(1):401.

doi: 10.1186/s12911-024-02785-9.

Authors

Eva-Lisa Meldau¹, Shachi Bista², Carlos Melgarejo-González², G Niklas Norén²

Affiliations

¹ Uppsala Monitoring Centre, Uppsala, Sweden. eva-lisa.meldau@who-umc.org.
² Uppsala Monitoring Centre, Uppsala, Sweden.

PMID: 39716217
PMCID: PMC11668006
DOI: 10.1186/s12911-024-02785-9

Abstract

Background: Automated recognition and redaction of personal identifiers in free text can enable organisations to share data while protecting privacy. This is important in the context of pharmacovigilance since relevant detailed information on the clinical course of events, differential diagnosis, and patient-reported reflections may often only be conveyed in narrative form. The aim of this study is to develop and evaluate a method for automated redaction of person names in English narrative text on adverse event reports. The target domain for this study was case narratives from the United Kingdom's Yellow Card scheme, which collects and monitors information on suspected side effects to medicines and vaccines.

Methods: We finetuned BERT - a transformer-based neural network - for recognising names in case narratives. Training data consisted of newly annotated records from the Yellow Card data and of the i2b2 2014 deidentification challenge. Because the Yellow Card data contained few names, we used predictive models to select narratives for training. Performance was evaluated on a separate set of annotated narratives from the Yellow Card scheme. In-depth review determined whether (parts of) person names missed by the de-identification method could enable re-identification of the individual, and whether de-identification reduced the clinical utility of narratives by collaterally masking relevant information.

Results: Recall on held-out Yellow Card data was 87% (155/179) at a precision of 55% (155/282) and a false-positive rate of 0.05% (127/ 263,451). Considering tokens longer than three characters separately, recall was 94% (102/108) and precision 58% (102/175). For 13 of the 5,042 narratives in Yellow Card test data (71 with person names), the method failed to flag at least one name token. According to in-depth review, the leaked information could enable direct identification for one narrative and indirect identification for two narratives. Clinically relevant information was removed in less than 1% of the 5,042 processed narratives; 97% of the narratives were completely untouched.

Conclusions: Automated redaction of names in free-text narratives of adverse event reports can achieve sufficient recall including shorter tokens like patient initials. In-depth review shows that the rare leaks that occur tend not to compromise patient confidentiality. Precision and false positive rates are acceptable with almost all clinically relevant information retained.

Keywords: Adverse drug reaction reporting systems; Data anonymization; De-identification; Domain adaptation; Medical language processing; Pharmacovigilance.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This study has received ethical approval from Etikprövningsmyndigheten, the Swedish Ethical Review Authority, Uppsala, Sweden (Dnr 2019–05722). This research falls within 3 § 1 p. etikprövningslagen (special categories of personal data), the national ethical review act. The research is not of the kind outlined in 4 § etikprövningslagen, which means that consent is not required for this research. All methods were performed in accordance with relevant guidelines and regulations. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests. Authors’ information: Uppsala Monitoring Centre is a non-profit foundation established in 1978 through an agreement between the World Health Organisation (WHO) and the Government of Sweden. It is the designated WHO Collaborating Centre for International Drug Monitoring and custodian and manager of VigiBase, the WHO global database of adverse event reports for medicines and vaccines. VigiBase brings together adverse event reports from members of the WHO Programme for International Drug Monitoring including the UK MHRA.

Figures

**Fig. 1**
Overview of the de-identification method and dataset preparation

**Fig. 2**
Tokens delimited by the simple alphanumeric tokeniser

**Fig. 3**
Venn diagram showing the cross-classification of tokens in the Yellow Card test data

**Fig. 4**
Example narratives where the NAME tokens were correctly flagged by the de-identification method (black background indicating NAME tokens flagged by the model, underline indicating NAME token annotations). N.B. All NAME tokens and personal identifiers are surrogates for similar entities in the original narratives. Drug names and dates have been replaced with placeholders

**Fig. 5**
Narratives with NAME tokens not flagged by our method (black background indicating NAME tokens flagged by the model, underline indicating NAME token annotations). ‘Ramesh Patel’ was classified as a direct identifier, ‘Kaveson’ was classified as an indirect identifier, and ‘SB’ and ‘deirdre’ were classified as not enabling re-identification in the context of the narratives. N.B. All NAME tokens and personal identifiers are surrogates for similar entities in the original narratives. Drug names and medical facility names have been replaced with placeholders

**Fig. 6**
Narratives with NON-NAME tokens incorrectly flagged by the method (black background indicating NAME tokens flagged by the model, underline indicating NAME token annotations). For the two narratives to the left, the redacted text was suspected to be clinically relevant; for the two narratives on top, the redacted text was classified as clinically relevant once revealed. N.B. All NAME tokens and personal identifiers are surrogates for similar entities in the original narratives. Drug names have been replaced with a placeholder

See this image and copyright information in PMC

References

1. World Health Organization. The importance of Pharmacovigilance - Safety monitoring of Medicinal products. World Health Organization; 2002.
1. Wise L, Parkinson J, Raine J, Breckenridge A. New approaches to drug safety: a pharmacovigilance tool kit. Nat Rev Drug Discov. 2009;8(10):779–82. - DOI - PubMed
1. Vandenbroucke JP. Defense of case reports and case series. Ann Intern Med. 2001;134(4):330. - DOI - PubMed
1. Onakpoya IJ, Heneghan CJ, Aronson JK. Post-marketing withdrawal of 462 medicinal products because of adverse drug reactions: a systematic review of the world literature. BMC Med. 2016;14(1):10. - DOI - PMC - PubMed
1. Karimi G, Star K, Lindquist M, Edwards IR. Clinical stories are necessary for drug safety. Clin Med (Lond). 2014;14(3):326–7. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated redaction of names in adverse event reports using transformer-based neural networks

Affiliations

Automated redaction of names in adverse event reports using transformer-based neural networks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources