NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Kanix Wang^{1

2}, Robert Stevens³, Halima Alachram⁴, Yu Li⁵, Larisa Soldatova⁶, Ross King^{7

8

9}, Sophia Ananiadou^{3

10}, Annika M Schoene^{3

10}, Maolin Li^{3

10}, Fenia Christopoulou^{3

10}, José Luis Ambite¹¹, Joel Matthew¹¹, Sahil Garg¹¹, Ulf Hermjakob¹¹, Daniel Marcu¹¹, Emily Sheng¹¹, Tim Beißbarth⁴, Edgar Wingender¹², Aram Galstyan¹¹, Xin Gao⁵, Brendan Chambers¹³, Weidi Pan¹⁴, Bohdan B Khomtchouk^{15

16}, James A Evans¹⁷, Andrey Rzhetsky^{18

19

20

21}

Affiliations

¹ The Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, 60637, US.
² The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US.
³ Depatment of Computer Science, University of Manchester, M13 9PL, Manchester, UK.
⁴ Institute of Medical Bioinformatics, University of Göttingen, Goldschmidtstrasse 1, 37077, Göttingen, Germany.
⁵ Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST) Thuwal, Thuwal, 23955, Saudi Arabia.
⁶ Goldsmiths, University of London, 8 Lewisham Way, New Cross, London, SE14 6NW, UK.
⁷ Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Dr, Cambridge, CB3 0AS, United Kingdom.
⁸ Alan Turing Institute, 96 Euston Rd, Somers Town, London, NW1 2DB, United Kingdom.
⁹ Department of Biology and Biological Engineering, Chalmers University of Technology, SE-412 96, Göteborg, Sweden.
¹⁰ National Centre for Text Mining, University of Manchester, M1 7DN, Manchester, UK.
¹¹ The Information Sciences Institute, University of Southern California, Marina del Rey, CA, 90089, US.
¹² geneXplain GmbH, Am Exer19b, 38302, Wolfenbüttel, Germany.
¹³ Knowledge Lab, Department of Sociology, University of Chicago, Chicago, IL, 60637, US.
¹⁴ Master of Science in Statistics Program, University of Chicago, Chicago, IL, 60637, US.
¹⁵ The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US. bohdan@uchicago.edu.
¹⁶ Department of Medicine, University of Chicago, Chicago, IL, 60637, US. bohdan@uchicago.edu.
¹⁷ Knowledge Lab, Department of Sociology, University of Chicago, Chicago, IL, 60637, US. jevans@uchicago.edu.
¹⁸ The Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
¹⁹ The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
²⁰ Department of Medicine, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
²¹ Department of Human Genetics, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.

PMID: 34671039
PMCID: PMC8528865
DOI: 10.1038/s41540-021-00200-x

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Kanix Wang et al. NPJ Syst Biol Appl. 2021.

. 2021 Oct 20;7(1):38.

doi: 10.1038/s41540-021-00200-x.

Authors

Affiliations

¹ The Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, 60637, US.
² The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US.
³ Depatment of Computer Science, University of Manchester, M13 9PL, Manchester, UK.
⁴ Institute of Medical Bioinformatics, University of Göttingen, Goldschmidtstrasse 1, 37077, Göttingen, Germany.
⁵ Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division King Abdullah University of Science and Technology (KAUST) Thuwal, Thuwal, 23955, Saudi Arabia.
⁶ Goldsmiths, University of London, 8 Lewisham Way, New Cross, London, SE14 6NW, UK.
⁷ Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Dr, Cambridge, CB3 0AS, United Kingdom.
⁸ Alan Turing Institute, 96 Euston Rd, Somers Town, London, NW1 2DB, United Kingdom.
⁹ Department of Biology and Biological Engineering, Chalmers University of Technology, SE-412 96, Göteborg, Sweden.
¹⁰ National Centre for Text Mining, University of Manchester, M1 7DN, Manchester, UK.
¹¹ The Information Sciences Institute, University of Southern California, Marina del Rey, CA, 90089, US.
¹² geneXplain GmbH, Am Exer19b, 38302, Wolfenbüttel, Germany.
¹³ Knowledge Lab, Department of Sociology, University of Chicago, Chicago, IL, 60637, US.
¹⁴ Master of Science in Statistics Program, University of Chicago, Chicago, IL, 60637, US.
¹⁵ The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US. bohdan@uchicago.edu.
¹⁶ Department of Medicine, University of Chicago, Chicago, IL, 60637, US. bohdan@uchicago.edu.
¹⁷ Knowledge Lab, Department of Sociology, University of Chicago, Chicago, IL, 60637, US. jevans@uchicago.edu.
¹⁸ The Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
¹⁹ The Institute of Genomics and Systems Biology, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
²⁰ Department of Medicine, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.
²¹ Department of Human Genetics, University of Chicago, Chicago, IL, 60637, US. andrey.rzhetsky@uchicago.edu.

PMID: 34671039
PMCID: PMC8528865
DOI: 10.1038/s41540-021-00200-x

Abstract

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades^1,2, the most dramatic advances in MR have followed in the wake of critical corpus development³. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet⁴ was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

PubMed Disclaimer

Conflict of interest statement

B.B.K. is a founder of Dock Therapeutics, Inc. The remaining authors declare no competing interests.

Figures

**Fig. 1. Named Entity Recognition Ontology (NERO).**
The Ontology is shown here as a multifurcating tree, with taxonomy nodes corresponding to ontology classes. Class name and class mentions count in the corpus are shown in parentheses next to each named entity class. Each taxonomy class is provided with a unique pictogram (black and red shapes on yellow background) intended to simplify expert manual annotation of the corpora. In total, we annotated 35,865 sentences. These sentences encapsulated 190,679 named entities and 43,438 events connecting two or more entities. In addition to the almost two dozen, more sparsely-used branches (such as *ExperimentalFactor* and *GeographicalLocation*) under the *NamedEntity* cluster, there are three heavily-represented branches in our corpus: *AnatomicalPart*, *Chemical*, and *Process*. Slightly more than half (51.6%) of all entities are from these three classes, with 26.6% of all entities originating from *Process* alone. We designed our ontology and its annotations to capture the named entities associated with research activities and facilities; these types of entities can be important for encoding methods used in scientific experiments or patient treatment. The semantic classes *ResearchActivity* and *MedicalProcedures* turn out to be the ninth and the tenth most frequent, respectively. Other top concepts related to the research include *Measurement*, *IntellectualProducts*, *PublishedSourceofInformation*, *Facility*, and *MentalProcess*.

**Fig. 2. The relative abundance of annotated named entity classes in our corpus.**
As is typically the case with human languages, semantic classes are represented unevenly in free texts, following a heavy-tail (Zipf’s) distribution. a In biomedical corpora, unsurprisingly, named entities associated with *genes* and *proteins* are the most prevalent (15%), followed by *processes* (9%), *medical findings* (8.8%), and *chemicals* (6.7%). At the low-frequency end of the named entity spectrum, we find *journal names*, *units*, *citations*, and *languages*. b Events connecting two or more entities are also approximately Zipf-law distributed. Event frequencies are closely tracking corresponding named entity classes. For example, the most frequent event, *bind*, is associated with the most frequently named entity, *GeneOrProtein*. We tried fitting the rank-ordered frequency distribution of annotated named entities with a Discrete Generalized Beta Distribution (DGBD). The result showed a significant deviation from Zipf’s law: The observed distribution’s tail was not heavy enough to match Zipf’s distribution, most likely due to the relatively small number of classes in our ontology. In other words, we expect that frequencies of semantic classes in a very large corpus, annotated with classes from a hypothetical perfect named entity ontology, would follow a Zipfian (discrete Pareto) distribution of named entity classes. Our action annotations have moved beyond interactions between proteins and genes (*e.g*., *bind*, *inhibit*, *phosphorylate*, *encode*), into interactions involving genetic variants and environmental factors (*e.g*., *associated with*, *occur in presence of*, *trigger*, *lack*). Ambiguity levels varied broadly across the named entities captured in our corpus. For example, in the class *AnatomicalPart*, almost all (99.3%) are annotated at the most specific levels, with the majority of entities belonging to *BodyPart*, *CellularComponent*, and *Cell*. In contrast, the general (most vague) concept, *Chemical*, turns out to be the most annotated within its cluster, although more specific subclasses, such as *Protein*, *NucleicAcid*, and *Drug* are also well represented in the corpus. In the *Process* concept cluster, about a third of all concept instances are annotated at a more general *Process* level, and the rest of them are specific concepts, such as *MedicalProcedure*, *MolecularProcess*, *ResearchActivity*, and *BiologicalProcess*. In addition to these major clusters of concepts, several individual concepts are well represented in the corpus. For example, *MedicalFinding* represents 7.3% of all entities. Other well-represented concepts include *Duration*, *IntellectualProduct*, *Measurement*, *Organism*, *PersonGroup*, *PublishedSourceOfInformation*, and *Quantity*. In total, about 70.4% of all entities are annotated at the most specific ontology level. There are five concepts in the NERO ontology that allow the semantic flexibility needed to avoid arbitrary concept assignment. Entities annotated as *AminaoAcidOrPeptide*, *QuantityOrMeasurement*, *PublicationOrCitation*, *MedicalProcedureOrDevice*, and *GeneOrProtein* account for 17.8% of all entities, while less than a quarter (23%) of entities representing either genes or proteins are cleanly annotated with class *Gene* or class *Protein*. The remainder are annotated with class *GeneOrProtein*. In addition to the action *bind*, actions indicating entities’ attributes are the next most frequent. Other biological relationships are also well-represented in this annotation, such as *inhibit*, *activate*, *mediate*, *interact*, *contain*, and *regulate*. The top 30 action categories account for 64.4% of all actions annotated with the top ten action categories accounting for 52.2%. Interestingly, negations of actions were also quite abundant in our annotated corpus. For example, *do not bind* was the sixth most frequent normalized action. Other well-represented negations of actions include *do not affect* and *do not inhibit* (see Supplementary Figs. 1–3).

**Fig. 3. Projection of text embedding into three-dimensional space.**
Properties of diseases and drugs are visible in the first three principal components of our multi-dimensional text embedding. The figure shows a projection of text embedding into three-dimensional space, with named entities corresponding to diseases and drugs shown with prisms and spheres, respectively. The figure represents several projections of the same embedding, preserving spatial layout and projection, with distinct elements of the embedding indicated by shape color. The central image shows all disease systems and their corresponding medications together. More specifically, the additional projections show: a Zollinger–Ellison syndrome and associated medications; b cancers and associated therapies; c central nervous system diseases and corresponding medications; d, e Viral and bacterial infectious diseases, respectively, together with corresponding antiviral and antibiotic agents, and f 3-dimensional projection of embedding drug- and disease-related named entities corresponding to CNS/Psychiatric- (red), digestive- (yellow), infectious/immune- (green), neoplastic- (cyan), and other diseases (grey). Another view of the same dataset is presented in Fig. 4.

**Fig. 4. Two-dimensional projections of diseases and medications.**
Left We projected diseases into two dimensions: female-male (X-axis) and severe-mild (Y-axis). We defined the “male–female” axis using the following pairs of terms: (“male,” “female”), (“prostate,” “ovary”), (“penile,” “uterine”), (“penis,’’ “uterus”), (“man,” “woman”), (“men,” “women”), (“masculine,” “feminine”), (“he,“ “she”), (“him,” “her”), (“his,” “hers”), (“boy,” “girl”), and (“boys,” “girls”). We defined the severe-mild axis with the following term pairs: (“harmful,” “beneficial”), (“serious,” “benign”), (“life-altering,” “common“), (“disruptive,” “undisruptive”), (“dying,’’ “recovering”), (“dangerous,” “safe”), (“threatening,” “low-priority”), (“high mortality,” “harmless”), (“costly,” “cheap”), (“hospitalized,” “self-administered”), (“hospital,” “work”), (“debt,” “savings”), (“low quality of life,” “undisruptive”), and (“hazard,” “routine'). Right We projected medications into “benign-toxic” (X-axis) and “cheap-costly” (Y-axis). For the “benign-toxic” axis, we used the following pairs of antonym words: (“harmful,” “beneficial”), (“toxic,” “nontoxic”), and (“noxious,” “benign”). We defined the “expensive–inexpensive” dimension using the following pairs of terms: (“expensive,” “inexpensive“), (“costly,” “cheap”), (“brand,” “generic”), and (“patented,” “off-patent”).

See this image and copyright information in PMC

References

1. Banko, M. & Brill, E. Scaling to very very large corpora for natural language disambiguation. In Proc. 39th Annual Meeting on Association for Computational Linguistics 26–33 (Association for Computational Linguistics, 2001).
1. Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2009;24:8–12. doi: 10.1109/MIS.2009.36. - DOI
1. Dogan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inf. 2014;47:1–10. doi: 10.1016/j.jbi.2013.12.006. - DOI - PMC - PubMed
1. Deng, J. et al. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
1. Wijffels, J. & Okazaki, N. crfsuite: Conditional Random Fields for Labelling Sequential Data in Natural Language Processing based on CRFsuite: a fast implementation of Conditional Random Fields (CRFs) https://github.com/bnosac/crfsuite (2007–2018).

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Affiliations

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources