Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec 14;20(Suppl 4):314.
doi: 10.1186/s12911-020-01341-5.

KGen: a knowledge graph generator from biomedical scientific literature

Affiliations

KGen: a knowledge graph generator from biomedical scientific literature

Anderson Rossanez et al. BMC Med Inform Decis Mak. .

Abstract

Background: Knowledge is often produced from data generated in scientific investigations. An ever-growing number of scientific studies in several domains result into a massive amount of data, from which obtaining new knowledge requires computational help. For example, Alzheimer's Disease, a life-threatening degenerative disease that is not yet curable. As the scientific community strives to better understand it and find a cure, great amounts of data have been generated, and new knowledge can be produced. A proper representation of such knowledge brings great benefits to researchers, to the scientific community, and consequently, to society.

Methods: In this article, we study and evaluate a semi-automatic method that generates knowledge graphs (KGs) from biomedical texts in the scientific literature. Our solution explores natural language processing techniques with the aim of extracting and representing scientific literature knowledge encoded in KGs. Our method links entities and relations represented in KGs to concepts from existing biomedical ontologies available on the Web. We demonstrate the effectiveness of our method by generating KGs from unstructured texts obtained from a set of abstracts taken from scientific papers on the Alzheimer's Disease. We involve physicians to compare our extracted triples from their manual extraction via their analysis of the abstracts. The evaluation further concerned a qualitative analysis by the physicians of the generated KGs with our software tool.

Results: The experimental results indicate the quality of the generated KGs. The proposed method extracts a great amount of triples, showing the effectiveness of our rule-based method employed in the identification of relations in texts. In addition, ontology links are successfully obtained, which demonstrates the effectiveness of the ontology linking method proposed in this investigation.

Conclusions: We demonstrate that our proposal is effective on building ontology-linked KGs representing the knowledge obtained from biomedical scientific texts. Such representation can add value to the research in various domains, enabling researchers to compare the occurrence of concepts from different studies. The KGs generated may pave the way to potential proposal of new theories based on data analysis to advance the state of the art in their research domains.

Keywords: Information Extraction; Knowledge Graphs; Ontologies; RDF Triples.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
KGen (knowledge graph generation) pipeline. The unstructured text (input) goes through four key steps. An ontology-linked knowledge graph is generated at the end
Fig. 2
Fig. 2
The first key step: preprocessing. The unstructured text (input) goes through four sub-steps. A preprocessed text is generated as output
Fig. 3
Fig. 3
A parse tree. Tokens are the seen at bottom (leaves), with their corresponding parts of speech right above. The root level denotes the sentence, and the intermediary levels denote the phrases
Fig. 4
Fig. 4
Preprocessing step’s input and output
Fig. 5
Fig. 5
The second key step: triples extraction. The preprocessed text (input) goes through two sub-steps, generating a set of triples as output
Fig. 6
Fig. 6
Algorithm for extracting the main triples
Fig. 7
Fig. 7
Dependency parsing output. At the bottom are the sentence tokens, with their corresponding parts of speech on top. The arrows show the labeled dependencies between the tokens
Fig. 8
Fig. 8
Algorithm for extracting the secondary triples
Fig. 9
Fig. 9
Triples extraction step’s input and output
Fig. 10
Fig. 10
The third key step: ontology linking. The preprocessed text (input) goes through three sub-steps. A set of ontology links are generated as output
Fig. 11
Fig. 11
SPARQL query example for mapping UMLS CUIs to the final ontology
Fig. 12
Fig. 12
Algorithm for ontology linking
Fig. 13
Fig. 13
Ontology linking step’s output
Fig. 14
Fig. 14
The final key step: graph generation. The sets of triples and links (inputs) go through two sub-steps before generating an ontology-linked knowledge graph as output
Fig. 15
Fig. 15
Graphical representation. Ontology-linked knowledge graph generated from the following sentence: This study confirms the high prevalence of poststroke cognitive impairment in diverse populations.
Fig. 16
Fig. 16
Implemented tool architecture. The four key KGen steps are implemented in four components, seen at the central portion. In the lower portion there are 3rd party components. In the upper portion, there are wrappers for external services
Fig. 17
Fig. 17
Reduced knowledge graph example. Knowledge graph generated for the triples extracted from the following sentence: This study highlights common risk factors, in particular diabetes mellitus.

References

    1. Ehrlinger L, Wöß W. Towards a definition of knowledge graphs. In: 12th International conference on semantic systems (SEMANTiCS2016) 2016.
    1. Candan KS, Liu H, Suvarna R. Resource description framework: metadata and its applications. SIGKDD Explor Newsl. 2001;3(1):6–19. doi: 10.1145/507533.507536. - DOI
    1. Bizer C. The emerging web of linked data. IEEE Intell Syst. 2009;24(5):87–92. doi: 10.1109/MIS.2009.102. - DOI
    1. Regino AG, Matsoui JKR, Dos Reis JC, Bonacin R, Morshed A, Sellis T. Understanding link changes in lod via the evolution of life science datasets. In: Proceedings of the workshop on semantic web solutions for large-scale biomedical data analytics. SeWeBMeDA 2019, 2019;40–54.
    1. Belleau F, Nolin M-A, Tourigny N, Rigault P, Morissette J. Bio2rdf: Towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41(5):706–716. doi: 10.1016/j.jbi.2008.03.004. - DOI - PubMed

Publication types