Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 20;7(1):38.
doi: 10.1038/s41540-021-00200-x.

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Affiliations

NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Kanix Wang et al. NPJ Syst Biol Appl. .

Abstract

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades1,2, the most dramatic advances in MR have followed in the wake of critical corpus development3. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

PubMed Disclaimer

Conflict of interest statement

B.B.K. is a founder of Dock Therapeutics, Inc. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Named Entity Recognition Ontology (NERO).
The Ontology is shown here as a multifurcating tree, with taxonomy nodes corresponding to ontology classes. Class name and class mentions count in the corpus are shown in parentheses next to each named entity class. Each taxonomy class is provided with a unique pictogram (black and red shapes on yellow background) intended to simplify expert manual annotation of the corpora. In total, we annotated 35,865 sentences. These sentences encapsulated 190,679 named entities and 43,438 events connecting two or more entities. In addition to the almost two dozen, more sparsely-used branches (such as ExperimentalFactor and GeographicalLocation) under the NamedEntity cluster, there are three heavily-represented branches in our corpus: AnatomicalPart, Chemical, and Process. Slightly more than half (51.6%) of all entities are from these three classes, with 26.6% of all entities originating from Process alone. We designed our ontology and its annotations to capture the named entities associated with research activities and facilities; these types of entities can be important for encoding methods used in scientific experiments or patient treatment. The semantic classes ResearchActivity and MedicalProcedures turn out to be the ninth and the tenth most frequent, respectively. Other top concepts related to the research include Measurement, IntellectualProducts, PublishedSourceofInformation, Facility, and MentalProcess.
Fig. 2
Fig. 2. The relative abundance of annotated named entity classes in our corpus.
As is typically the case with human languages, semantic classes are represented unevenly in free texts, following a heavy-tail (Zipf’s) distribution. a In biomedical corpora, unsurprisingly, named entities associated with genes and proteins are the most prevalent (15%), followed by processes (9%), medical findings (8.8%), and chemicals (6.7%). At the low-frequency end of the named entity spectrum, we find journal names, units, citations, and languages. b Events connecting two or more entities are also approximately Zipf-law distributed. Event frequencies are closely tracking corresponding named entity classes. For example, the most frequent event, bind, is associated with the most frequently named entity, GeneOrProtein. We tried fitting the rank-ordered frequency distribution of annotated named entities with a Discrete Generalized Beta Distribution (DGBD). The result showed a significant deviation from Zipf’s law: The observed distribution’s tail was not heavy enough to match Zipf’s distribution, most likely due to the relatively small number of classes in our ontology. In other words, we expect that frequencies of semantic classes in a very large corpus, annotated with classes from a hypothetical perfect named entity ontology, would follow a Zipfian (discrete Pareto) distribution of named entity classes. Our action annotations have moved beyond interactions between proteins and genes (e.g., bind, inhibit, phosphorylate, encode), into interactions involving genetic variants and environmental factors (e.g., associated with, occur in presence of, trigger, lack). Ambiguity levels varied broadly across the named entities captured in our corpus. For example, in the class AnatomicalPart, almost all (99.3%) are annotated at the most specific levels, with the majority of entities belonging to BodyPart, CellularComponent, and Cell. In contrast, the general (most vague) concept, Chemical, turns out to be the most annotated within its cluster, although more specific subclasses, such as Protein, NucleicAcid, and Drug are also well represented in the corpus. In the Process concept cluster, about a third of all concept instances are annotated at a more general Process level, and the rest of them are specific concepts, such as MedicalProcedure, MolecularProcess, ResearchActivity, and BiologicalProcess. In addition to these major clusters of concepts, several individual concepts are well represented in the corpus. For example, MedicalFinding represents 7.3% of all entities. Other well-represented concepts include Duration, IntellectualProduct, Measurement, Organism, PersonGroup, PublishedSourceOfInformation, and Quantity. In total, about 70.4% of all entities are annotated at the most specific ontology level. There are five concepts in the NERO ontology that allow the semantic flexibility needed to avoid arbitrary concept assignment. Entities annotated as AminaoAcidOrPeptide, QuantityOrMeasurement, PublicationOrCitation, MedicalProcedureOrDevice, and GeneOrProtein account for 17.8% of all entities, while less than a quarter (23%) of entities representing either genes or proteins are cleanly annotated with class Gene or class Protein. The remainder are annotated with class GeneOrProtein. In addition to the action bind, actions indicating entities’ attributes are the next most frequent. Other biological relationships are also well-represented in this annotation, such as inhibit, activate, mediate, interact, contain, and regulate. The top 30 action categories account for 64.4% of all actions annotated with the top ten action categories accounting for 52.2%. Interestingly, negations of actions were also quite abundant in our annotated corpus. For example, do not bind was the sixth most frequent normalized action. Other well-represented negations of actions include do not affect and do not inhibit (see Supplementary Figs. 1–3).
Fig. 3
Fig. 3. Projection of text embedding into three-dimensional space.
Properties of diseases and drugs are visible in the first three principal components of our multi-dimensional text embedding. The figure shows a projection of text embedding into three-dimensional space, with named entities corresponding to diseases and drugs shown with prisms and spheres, respectively. The figure represents several projections of the same embedding, preserving spatial layout and projection, with distinct elements of the embedding indicated by shape color. The central image shows all disease systems and their corresponding medications together. More specifically, the additional projections show: a Zollinger–Ellison syndrome and associated medications; b cancers and associated therapies; c central nervous system diseases and corresponding medications; d, e Viral and bacterial infectious diseases, respectively, together with corresponding antiviral and antibiotic agents, and f 3-dimensional projection of embedding drug- and disease-related named entities corresponding to CNS/Psychiatric- (red), digestive- (yellow), infectious/immune- (green), neoplastic- (cyan), and other diseases (grey). Another view of the same dataset is presented in Fig. 4.
Fig. 4
Fig. 4. Two-dimensional projections of diseases and medications.
Left We projected diseases into two dimensions: female-male (X-axis) and severe-mild (Y-axis). We defined the “male–female” axis using the following pairs of terms: (“male,” “female”), (“prostate,” “ovary”), (“penile,” “uterine”), (“penis,’’ “uterus”), (“man,” “woman”), (“men,” “women”), (“masculine,” “feminine”), (“he,“ “she”), (“him,” “her”), (“his,” “hers”), (“boy,” “girl”), and (“boys,” “girls”). We defined the severe-mild axis with the following term pairs: (“harmful,” “beneficial”), (“serious,” “benign”), (“life-altering,” “common“), (“disruptive,” “undisruptive”), (“dying,’’ “recovering”), (“dangerous,” “safe”), (“threatening,” “low-priority”), (“high mortality,” “harmless”), (“costly,” “cheap”), (“hospitalized,” “self-administered”), (“hospital,” “work”), (“debt,” “savings”), (“low quality of life,” “undisruptive”), and (“hazard,” “routine'). Right We projected medications into “benign-toxic” (X-axis) and “cheap-costly” (Y-axis). For the “benign-toxic” axis, we used the following pairs of antonym words: (“harmful,” “beneficial”), (“toxic,” “nontoxic”), and (“noxious,” “benign”). We defined the “expensive–inexpensive” dimension using the following pairs of terms: (“expensive,” “inexpensive“), (“costly,” “cheap”), (“brand,” “generic”), and (“patented,” “off-patent”).

References

    1. Banko, M. & Brill, E. Scaling to very very large corpora for natural language disambiguation. In Proc. 39th Annual Meeting on Association for Computational Linguistics 26–33 (Association for Computational Linguistics, 2001).
    1. Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2009;24:8–12. doi: 10.1109/MIS.2009.36. - DOI
    1. Dogan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inf. 2014;47:1–10. doi: 10.1016/j.jbi.2013.12.006. - DOI - PMC - PubMed
    1. Deng, J. et al. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, 2009).
    1. Wijffels, J. & Okazaki, N. crfsuite: Conditional Random Fields for Labelling Sequential Data in Natural Language Processing based on CRFsuite: a fast implementation of Conditional Random Fields (CRFs) https://github.com/bnosac/crfsuite (2007–2018).

Publication types