. 2018 Jun 28;19(1):248.

doi: 10.1186/s12859-018-2211-5.

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining

John A Bachman¹, Benjamin M Gyori¹, Peter K Sorger²

Affiliations

¹ Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA.
² Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA. peter_sorger@hms.harvard.edu.

PMID: 29954318
PMCID: PMC6022344
DOI: 10.1186/s12859-018-2211-5

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining

John A Bachman et al. BMC Bioinformatics. 2018.

. 2018 Jun 28;19(1):248.

doi: 10.1186/s12859-018-2211-5.

Authors

John A Bachman¹, Benjamin M Gyori¹, Peter K Sorger²

Affiliations

¹ Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA.
² Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Ave, Boston, MA, 02115, USA. peter_sorger@hms.harvard.edu.

PMID: 29954318
PMCID: PMC6022344
DOI: 10.1186/s12859-018-2211-5

Abstract

Background: For automated reading of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers, a process known as named entity linking or "grounding." Correct grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. The accuracy of this process is largely dependent on the availability of machine-readable resources associating synonyms and abbreviations commonly found in biomedical literature with uniform identifiers.

Results: In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., "AKT") and complexes with multiple subunits (e.g."NF- κB"). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%). The hierarchical organization of entities in FamPlex also made it possible to integrate otherwise unconnected mechanistic information across families, subfamilies, and individual proteins. Applications of FamPlex to the TRIPS/DRUM reading system and the Biocreative VI Bioentity Normalization Task dataset demonstrated the utility of FamPlex in other settings.

Conclusion: FamPlex is an effective resource for improving named entity recognition, grounding, and relationship resolution in automated reading of biomedical text. The content in FamPlex is available in both tabular and Open Biomedical Ontology formats at https://github.com/sorgerlab/famplex under the Creative Commons CC0 license and has been integrated into the TRIPS/DRUM and REACH reading systems.

Keywords: Biocuration; Event extraction; Grounding; Named entity linking; Named entity recognition; Natural language processing; Protein families; Text mining.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
FamPlex links named entities to protein families and complexes and their constituents. a Structure of FamPlex content. The affixes in gene_prefixes.csv can be used to improve recognition of molecular entity names, which can be linked to database identifiers using the lexical synonyms in grounding_map.csv. FamPlex itself contains identifiers representing families and complexes which are mapped to corresponding identifiers in other databases in equivalences.csv. Hierarchical relationships among families, complexes, and genes are listed in relations.csv. b Workflow for curation and evaluation. A gene list was used to define a corpus of articles that was divided into two subsets, “training” and “test”. The “training” corpus was processed with REACH and results were evaluated and used to guide curation. The “test” corpus was processed after incorporation of FamPlex and results were compared against the baseline from the training corpus

**Fig. 2**
FamPlex links identifiers for families and complexes to members, other databases, and lexical synonyms. a FamPlex uses *isa* and *partof* predicates to represent the hierarchical relationships between specific genes, families and complexes. Lexical synonyms can be associated with entities at each level. b Mappings of FamPlex identifiers to outside databases. c Number of lexical synonyms curated for FamPlex identifiers in the grounding map

**Fig. 3**
FamPlex improves grounding accuracy. a Cumulative occurrences of ungrounded entities by frequency of the entity text. Deviation from the dotted gray line, representing a uniform frequency distribution, indicates the extent to which a small number of frequently occurring entities account for a disproportionate share of missed groundings. b Improvements in grounding accuracy for proteins/genes and families/complexes, with and without the use of FamPlex. c Reduction in the proportion of extracted events containing ungrounded entities, with and without FamPlex. d Number of groundings to FamPlex identifiers in the test corpus. The 15 most frequent identifiers account for 50% of all groundings and are shown in blue

**Fig. 4**
FamPlex facilitates hierarchical resolution of extracted information. a Hierarchical organization of the phospholipase C protein family (FamPlex identifier PLC) along with the proportion of occurrences of each member in the test corpus and examples of sentences yielding information at the different levels. Pink nodes indicate FamPlex families; gray nodes indicate genes. b Proportion of groundings in the test corpus to gene-level, intermediate-level, or top-level entities for five multi-level families/complexes in FamPlex

See this image and copyright information in PMC

References

1. Babur O, Gönen M, Aksoy BA, Schultz N, Ciriello G, Sander C, Demir E. Systematic identification of cancer driving signaling pathways based on mutual exclusivity of genomic alterations. Genome Biol. 2015;16:45. doi: 10.1186/s13059-015-0612-6. - DOI - PMC - PubMed
1. García-Campos MA, Espinal-Enríquez J, Hernández-Lemus E. Pathway Analysis: State of the Art. Front Physiol. 2015;6:383. doi: 10.3389/fphys.2015.00383. - DOI - PMC - PubMed
1. Korkut A, Wang W, Demir E, Aksoy BA, Jing X, Molinelli EJ, Babur O, Bemis DL, Sumer SO, Solit DB, et al. Perturbation biology nominates upstream–downstream drug combinations in RAF inhibitor resistant melanoma cells. Elife. 2015;4:04640. doi: 10.7554/eLife.04640. - DOI - PMC - PubMed
1. Campbell J, Ryan CJ, Brough R, Bajrami I, Pemberton HN, Chong IY, Costa-Cabral S, Frankum J, Gulati A, Holme H, Miller R, Postel-Vinay S, Rafiq R, Wei W, Williamson CT, Quigley DA, Tym J, Al-Lazikani B, Fenton T, Natrajan R, Strauss SJ, Ashworth A, Lord CJ. Large-scale profiling of kinase dependencies in cancer cell lines. Cell Rep. 2016;14(10):2490–501. doi: 10.1016/j.celrep.2016.02.023. - DOI - PMC - PubMed
1. Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D’Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinath G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Ruebenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Whirl-Carrillo M, Cheung K-H, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD. The BioPAX community standard for pathway data sharing. Nat Biotechnol. 2010;28(9):935–42. doi: 10.1038/nbt.1666. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining

Affiliations

FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Competing interests

Publisher’s Note

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases