Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Oct 8:5:147.
doi: 10.1186/1471-2105-5-147.

Content-rich biological network constructed by mining PubMed abstracts

Affiliations

Content-rich biological network constructed by mining PubMed abstracts

Hao Chen et al. BMC Bioinformatics. .

Abstract

Background: The integration of the rapidly expanding corpus of information about the genome, transcriptome, and proteome, engendered by powerful technological advances, such as microarrays, and the availability of genomic sequence from multiple species, challenges the grasp and comprehension of the scientific community. Despite the existence of text-mining methods that identify biological relationships based on the textual co-occurrence of gene/protein terms or similarities in abstract texts, knowledge of the underlying molecular connections on a large scale, which is prerequisite to understanding novel biological processes, lags far behind the accumulation of data. While computationally efficient, the co-occurrence-based approaches fail to characterize (e.g., inhibition or stimulation, directionality) biological interactions. Programs with natural language processing (NLP) capability have been created to address these limitations, however, they are in general not readily accessible to the public.

Results: We present a NLP-based text-mining approach, Chilibot, which constructs content-rich relationship networks among biological concepts, genes, proteins, or drugs. Amongst its features, suggestions for new hypotheses can be generated. Lastly, we provide evidence that the connectivity of molecular networks extracted from the biological literature follows the power-law distribution, indicating scale-free topologies consistent with the results of previous experimental analyses.

Conclusions: Chilibot distills scientific relationships from knowledge available throughout a wide range of biological domains and presents these in a content-rich graphical format, thus integrating general biomedical knowledge with the specialized knowledge and interests of the user. Chilibot http://www.chilibot.net can be accessed free of charge to academic users.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The network map of a biological network constructed by Chilibot. Chilibot queried the entire PubMed abstract database to identify a network of relationships amongst a set of genes reported to be regulated by cocaine [44], a biological concept ("plasticity"), and a drug ("cocaine"). Lines connecting rectangular nodes indicate relationships between the genes shown, and each icon in the middle of a line represents the character of the relationship. Interactive relationships (circles) are neutral (gray), stimulatory (green), inhibitory (red) or both stimulatory/inhibitory (yellow). The number within each icon indicates the quantity of abstracts retrieved for documenting that relationship. Icons containing the plus sign ("+") represent "parallel relationships". Gray rhomboidal icons indicate that only co-occurrence was detected. All arrowheads indicate the direction of the interaction, and some are bi-directional. The green or pink colors of rectangular nodes represent up- or down-regulation of the genes identified therein, respectively, based on experimental data provided by the user. More saturated colors are associated with larger changes. Nodes with no expression values (e.g., "cocaine") are in cyan. The terms and icons are linked to documentation when viewed in a web-browser. See supplementary information for subnetwork maps generated by Chilibot.
Figure 2
Figure 2
Distribution of the number of synonyms. A synonym dictionary of gene symbols was compiled from 6 databases with a total of 113,503 unique symbols. Analysis of the number of synonyms for each symbol shows that 62,178 (54.8%) had more than one.
Figure 3
Figure 3
Effects of the number of abstracts obtained on retrieval, recall, and content of relationships. To measure Chilibot's level of recall, a total of 770 known relationships specified in the Database of Interacting Proteins (DIP) was used as a reference set. A. Distribution of the number of sentences describing relationships when a maximum of 5–50 abstracts were selected for retrieval. For each group, the average number of sentences documenting a relationship is reported. Of the 770 known relationships, the histograms show that an increasing number of relationships are documented by a larger number of sentences when a greater number of abstracts are specified for retrieval. B. Increasing the specified number of abstracts for retrieval from 5 to 50 had no affect on the recall of total relationships, although there were changes within relationship categories (e.g., stimulatory/inhibitory).
Figure 4
Figure 4
Scale-free topology of a relationship network derived from the biological literature. Chilibot was used to retrieve the relationships within 3 sets of randomly selected genes (300 genes per group). The resulting networks contain 224, 116, and 138 nodes and 3018, 962, and 1912 relationships, respectively. The distribution of the average connectivity of the 3 groups follows the power-law (P(k) ~k-n, n = 1.21).

References

    1. Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003;31:248–250. doi: 10.1093/nar/gkg056. - DOI - PMC - PubMed
    1. Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D. DIP: the database of interacting proteins. Nucleic Acids Res. 2000;28:289–291. doi: 10.1093/nar/28.1.289. - DOI - PMC - PubMed
    1. Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28:21–28. doi: 10.1038/88213. - DOI - PubMed
    1. Stapley BJ, Benoit G. Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Pac Symp Biocomput. 2000:529–540. - PubMed
    1. Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B, Zhang S, Baskin B, Bader GD, Michalickova K. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003;4:11. doi: 10.1186/1471-2105-4-11. - DOI - PMC - PubMed

Publication types

LinkOut - more resources