Annotation and initial evaluation of a large annotated German oncological corpus

Madeleine Kittner¹, Mario Lamping^{2

3}, Damian T Rieke^{2

3

4}, Julian Götze⁵, Bariya Bajwa⁵, Ivan Jelas³, Gina Rüter³, Hanjo Hautow¹, Mario Sänger¹, Maryam Habibi¹, Marit Zettwitz³, Till de Bortoli³, Leonie Ostermann⁵, Jurica Ševa¹, Johannes Starlinger¹, Oliver Kohlbacher^{6

7

8

9}, Nisar P Malek⁵, Ulrich Keilholz³, Ulf Leser¹

Affiliations

¹ Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
² Department of Hematology, Oncology and Cancer Immunology, Campus Benjamin Franklin, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
³ Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
⁴ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
⁵ Innere Medizin I, Universitätsklinikum Tübingen, Tübingen, Germany.
⁶ Institut für Translationale Bioinformatik, Universitätsklinikum Tübingen, Tübingen, Germany.
⁷ Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany.
⁸ Department of Computer Science, University of Tübingen, Tübingen, Germany.
⁹ Biomolecular Interactions, Max Planck Institute for Developmental Biology, Tübingen, Germany.

PMID: 33898938
PMCID: PMC8054032
DOI: 10.1093/jamiaopen/ooab025

Annotation and initial evaluation of a large annotated German oncological corpus

Madeleine Kittner et al. JAMIA Open. 2021.

. 2021 Apr 19;4(2):ooab025.

doi: 10.1093/jamiaopen/ooab025. eCollection 2021 Apr.

Authors

Affiliations

¹ Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany.
² Department of Hematology, Oncology and Cancer Immunology, Campus Benjamin Franklin, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
³ Charité Comprehensive Cancer Center, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
⁴ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
⁵ Innere Medizin I, Universitätsklinikum Tübingen, Tübingen, Germany.
⁶ Institut für Translationale Bioinformatik, Universitätsklinikum Tübingen, Tübingen, Germany.
⁷ Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany.
⁸ Department of Computer Science, University of Tübingen, Tübingen, Germany.
⁹ Biomolecular Interactions, Max Planck Institute for Developmental Biology, Tübingen, Germany.

PMID: 33898938
PMCID: PMC8054032
DOI: 10.1093/jamiaopen/ooab025

Abstract

Objective: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts.

Materials and methods: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research.

Results: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72-0.90 for named entity recognition, 0.10-0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection.

Discussion: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important.

Conclusion: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.

Keywords: German language; corpus annotation; medical information extraction.

PubMed Disclaimer

Figures

**Figure 1.**
Exemplary excerpts from original discharge summaries and annotations, shown in BRAT visualization. Attributes in brackets have the following meaning: laterality right (R), negated entity (negative), speculative entity (speculative), and entity possible in the future (possibleFuture). Additionally, codes resulting from entity normalization are given in brackets.

**Figure 2.**
Annotation procedure including deidentification, annotation of section titles, and annotation of medical entities with attributes. Altogether, 1 annotation leader and 9 medical annotators were involved in different parts of the process.

**Figure 3.**
Visualization of mismatches between annotations of 2 annotators, shown in BRAT visualization. (A) One of the annotations misses Laterality R and (B) “Oberbauchsonographie” (sonography of the upper abdomen) is annotated only by 1 annotator and “Ausschluss von Leberfiliae” (exclusion of liver metastasis) is annotated with different text spans and only once with attribute possibleFuture.

**Figure 4.**
Distribution of documents per cluster after hierarchical clustering of sentences in BRONCO150.

See this image and copyright information in PMC

References

1. Jensen PB, Jensen LJ, Brunak S.. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet 2012; 13 (6): 395–405. - PubMed
1. Chapman WW, Nadkarni PM, Hirschman L, et al.Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J Am Med Inform Assoc 2011; 18 (5): 540–3. - PMC - PubMed
1. Dernoncourt F, Lee JY, Uzuner O, et al.De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24 (3): 596–606. - PMC - PubMed
1. Liu Z, Tang B, Wang X, et al.De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform 2017; 75: S34–42. - PMC - PubMed
1. Stubbs A, Filannino M, Uzuner Ö.. De-identification of psychiatric intake records: overview of 2016 CEGS N-GRID shared tasks Track 1. J Biomed Inform 2017; 75: S4–18. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Annotation and initial evaluation of a large annotated German oncological corpus

Affiliations

Annotation and initial evaluation of a large annotated German oncological corpus

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases