Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jun 9;12 Suppl 3(Suppl 3):S2.
doi: 10.1186/1471-2105-12-S3-S2.

A system for de-identifying medical message board text

Affiliations

A system for de-identifying medical message board text

Adrian Benton et al. BMC Bioinformatics. .

Abstract

There are millions of public posts to medical message boards by users seeking support and information on a wide range of medical conditions. It has been shown that these posts can be used to gain a greater understanding of patients' experiences and concerns. As investigators continue to explore large corpora of medical discussion board data for research purposes, protecting the privacy of the members of these online communities becomes an important challenge that needs to be met. Extant entity recognition methods used for more structured text are not sufficient because message posts present additional challenges: the posts contain many typographical errors, larger variety of possible names, terms and abbreviations specific to Internet posts or a particular message board, and mentions of the authors' personal lives. The main contribution of this paper is a system to de-identify the authors of message board posts automatically, taking into account the aforementioned challenges. We demonstrate our system on two different message board corpora, one on breast cancer and another on arthritis. We show that our approach significantly outperforms other publicly available named entity recognition and de-identification systems, which have been tuned for more structured text like operative reports, pathology reports, discharge summaries, or newswire.

PubMed Disclaimer

Figures

Figure 1
Figure 1
De-identification process
Figure 2
Figure 2
Performance of our system over development and test sets, varying the likelihood threshold The blue curve displays the precision and recall of our system over the development set, while varying the likelihood threshold. In this figure, the values for the likelihood threshold ranged from 0.5 to 0.005 and are displayed for major intervals on the curve. The threshold value of 0.05 was chosen, since it seemed to yield the highest recall without unnecessarily sacrificing precision over the development set. The red isolated point corresponds to the performance of our system, using the chosen threshold value of 0.05, over the arthritis corpus test set, while the blue point corresponds to its performance over the development set. The red curve corresponds to our system’s performance over the arthritis test set. Note that this curve has a similar trajectory to the performance over the BC development set and that the point of 0.05 likelihood threshold on it corresponds to a similar precision/recall trade-off as the development curve.

References

    1. RH Kenen, Shapiro PJ, Friedman S, Coyne JC. Peer-support in coping with medical uncertainty: discussion of oophorectomy and hormone replacement therapy on a web-based message board. Psychooncology. 2007;16:763–771. doi: 10.1002/pon.1152. - DOI - PubMed
    1. Hadert A, Rodham K. The invisible reality of arthritis: a qualitative analysis of an online message board. Musculoskeletal Care. 2008;6(3):181–96. doi: 10.1002/msc.131. - DOI - PubMed
    1. Moloney MF, Strickland OL, DeRossett SE, Melby MK, Dietrich AS. The experiences of midlife women with migraines. Journal of Nursing Scholarship. 2006;38(3):278–85. doi: 10.1111/j.1547-5069.2006.00114.x. [see comment] - DOI - PubMed
    1. Nadeau David, Sekine Satoshi. A survey of named entity recognition and classification. Lingvisticae Investigationes. 2007;30:3–26. doi: 10.1075/li.30.1.03nad. - DOI
    1. Lewin Beverly A., Donner Yonatan. Communication in Internet message boards. English Today. 2002;18:29–37. doi: 10.1017/S026607840200305X. - DOI

Publication types