Multicenter Study

. 2022 May 8;13(1):13.

doi: 10.1186/s13326-022-00269-1.

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Lucas Emanuel Silva E Oliveira¹, Ana Carolina Peters², Adalniza Moura Pucca da Silva², Caroline Pilatti Gebeluca², Yohan Bonescki Gumiel², Lilian Mie Mukai Cintho², Deborah Ribeiro Carvalho², Sadid Al Hasan³, Claudia Maria Cabral Moro²

Affiliations

¹ Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil. kunkaweb@gmail.com.
² Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil.
³ AI Lab, Philips Research North America, Cambridge, MA, USA.

PMID: 35527259
PMCID: PMC9080187
DOI: 10.1186/s13326-022-00269-1

Multicenter Study

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Lucas Emanuel Silva E Oliveira et al. J Biomed Semantics. 2022.

. 2022 May 8;13(1):13.

doi: 10.1186/s13326-022-00269-1.

Authors

Affiliations

¹ Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil. kunkaweb@gmail.com.
² Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901, Brazil.
³ AI Lab, Philips Research North America, Cambridge, MA, USA.

PMID: 35527259
PMCID: PMC9080187
DOI: 10.1186/s13326-022-00269-1

Abstract

Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.

Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.

Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.

Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.

Keywords: Clinical narratives; Corpora; Gold standard; Natural language processing; Semantic annotation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
A broad view of SemClinBr corpus development. The diagram is an overview of the SemClinBr corpus development, which shows the selection of thousands of clinical notes from multiple hospitals and medical specialties. A multidisciplinary team developed the elements in orange, representing (i) the fine-grained annotation schema following the UMLS semantic types and (ii) the web-based annotation tool featuring the UMLS REST API. These resources supported the generation of the ground truth (i.e., gold standard), which was evaluated intrinsically (i.e., inter-annotation agreement) and extrinsically in two different NLP tasks (i.e., named entity recognition and negation detection)

**Fig. 2**
Revision and quality verification process of the annotation guidelines. The iterative process started with the first guideline draft; then, a small number of documents were double-annotated, and their inter-annotator agreement was calculated. If the agreement remained stable, then the guideline was considered good enough to proceed with the gold standard production. Otherwise, the annotation differences were discussed; the guidelines were updated; and the process was reinitiated

**Fig. 3**
Annotation process overview. The annotation process was divided into ground-truth phases 1 and 2, which are located above and below the dashed line, respectively. The elements in green represent the annotators and orange represents the adjudicators

**Fig. 4**
Average IAA values for the most frequent STYs. The average IAA scores for the most frequent semantic types and their corresponding semantic groups (in parentheses). The heat map indicates the highest values in blue and the lowest values in red

See this image and copyright information in PMC

Cited by

Disambiguation of acronyms in clinical narratives with large language models.
Kugic A, Schulz S, Kreuzthaler M. Kugic A, et al. J Am Med Inform Assoc. 2024 Sep 1;31(9):2040-2046. doi: 10.1093/jamia/ocae157. J Am Med Inform Assoc. 2024. PMID: 38917444 Free PMC article.
Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop.
Yada S, Nakamura Y, Wakamiya S, Aramaki E. Yada S, et al. Methods Inf Med. 2024 Dec;63(5-06):145-163. doi: 10.1055/a-2405-2489. Epub 2024 Aug 29. Methods Inf Med. 2024. PMID: 39209296 Free PMC article.
Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.
Shaitarova A, Zaghir J, Lavelli A, Krauthammer M, Rinaldi F. Shaitarova A, et al. Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26. Yearb Med Inform. 2023. PMID: 38147865 Free PMC article.
Year 2022 in Medical Natural Language Processing: Availability of Language Models as a Step in the Democratization of NLP in the Biomedical Area.
Grouin C, Grabar N; Section Editors for the IMIA Yearbook Section on Natural Language Processing. Grouin C, et al. Yearb Med Inform. 2023 Aug;32(1):244-252. doi: 10.1055/s-0043-1768752. Epub 2023 Dec 26. Yearb Med Inform. 2023. PMID: 38147866 Free PMC article.

References

1. Yadav P, Steinbach M, Kumar V, Simon G. Mining electronic health records (EHRs): a survey. ACM Comput Surv. 2018;50:1–40. doi: 10.1145/3127881. - DOI
1. Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The revival of the notes field: leveraging the unstructured content in electronic health records. Front Med. 2019;6:1–23. doi: 10.3389/fmed.2019.00066. - DOI - PMC - PubMed
1. Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semantics. 2018;9:12. doi: 10.1186/s13326-018-0179-8. - DOI - PMC - PubMed
1. Jovanović J, Bagheri E. Semantic annotation in biomedicine: the current landscape. J Biomed Semantics. 2017;8:44. doi: 10.1186/s13326-017-0153-x. - DOI - PMC - PubMed
1. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/ind.... Accessed 25 Apr 2022.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

Finance Code 001/Coordenação de Aperfeiçoamento de Pessoal de Nível Superior

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Affiliations

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources