Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2022 May 8;13(1):13.
doi: 10.1186/s13326-022-00269-1.

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Affiliations
Multicenter Study

SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks

Lucas Emanuel Silva E Oliveira et al. J Biomed Semantics. .

Abstract

Background: The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field.

Methods: In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations.

Results: This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores.

Conclusion: The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.

Keywords: Clinical narratives; Corpora; Gold standard; Natural language processing; Semantic annotation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
A broad view of SemClinBr corpus development. The diagram is an overview of the SemClinBr corpus development, which shows the selection of thousands of clinical notes from multiple hospitals and medical specialties. A multidisciplinary team developed the elements in orange, representing (i) the fine-grained annotation schema following the UMLS semantic types and (ii) the web-based annotation tool featuring the UMLS REST API. These resources supported the generation of the ground truth (i.e., gold standard), which was evaluated intrinsically (i.e., inter-annotation agreement) and extrinsically in two different NLP tasks (i.e., named entity recognition and negation detection)
Fig. 2
Fig. 2
Revision and quality verification process of the annotation guidelines. The iterative process started with the first guideline draft; then, a small number of documents were double-annotated, and their inter-annotator agreement was calculated. If the agreement remained stable, then the guideline was considered good enough to proceed with the gold standard production. Otherwise, the annotation differences were discussed; the guidelines were updated; and the process was reinitiated
Fig. 3
Fig. 3
Annotation process overview. The annotation process was divided into ground-truth phases 1 and 2, which are located above and below the dashed line, respectively. The elements in green represent the annotators and orange represents the adjudicators
Fig. 4
Fig. 4
Average IAA values for the most frequent STYs. The average IAA scores for the most frequent semantic types and their corresponding semantic groups (in parentheses). The heat map indicates the highest values in blue and the lowest values in red

Similar articles

Cited by

References

    1. Yadav P, Steinbach M, Kumar V, Simon G. Mining electronic health records (EHRs): a survey. ACM Comput Surv. 2018;50:1–40. doi: 10.1145/3127881. - DOI
    1. Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The revival of the notes field: leveraging the unstructured content in electronic health records. Front Med. 2019;6:1–23. doi: 10.3389/fmed.2019.00066. - DOI - PMC - PubMed
    1. Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical natural language processing in languages other than English: opportunities and challenges. J Biomed Semantics. 2018;9:12. doi: 10.1186/s13326-018-0179-8. - DOI - PMC - PubMed
    1. Jovanović J, Bagheri E. Semantic annotation in biomedicine: the current landscape. J Biomed Semantics. 2017;8:44. doi: 10.1186/s13326-017-0153-x. - DOI - PMC - PubMed
    1. Summary of the HIPAA privacy rule. https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/ind.... Accessed 25 Apr 2022.

Publication types

LinkOut - more resources