. 2005 Mar 11:6:51.

doi: 10.1186/1471-2105-6-51.

CoPub Mapper: mining MEDLINE based on search term co-publication

Blaise T F Alako¹, Antoine Veldhoven, Sjozef van Baal, Rob Jelier, Stefan Verhoeven, Ton Rullmann, Jan Polman, Guido Jenster

Affiliations

PMID: 15760478
PMCID: PMC1274248
DOI: 10.1186/1471-2105-6-51

CoPub Mapper: mining MEDLINE based on search term co-publication

Blaise T F Alako et al. BMC Bioinformatics. 2005.

. 2005 Mar 11:6:51.

doi: 10.1186/1471-2105-6-51.

Authors

Blaise T F Alako¹, Antoine Veldhoven, Sjozef van Baal, Rob Jelier, Stefan Verhoeven, Ton Rullmann, Jan Polman, Guido Jenster

Affiliation

¹ Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands. blaise.alako@wur.nl

PMID: 15760478
PMCID: PMC1274248
DOI: 10.1186/1471-2105-6-51

Abstract

Background: High throughput microarray analyses result in many differentially expressed genes that are potentially responsible for the biological process of interest. In order to identify biological similarities between genes, publications from MEDLINE were identified in which pairs of gene names and combinations of gene name with specific keywords were co-mentioned.

Results: MEDLINE search strings for 15,621 known genes and 3,731 keywords were generated and validated. PubMed IDs were retrieved from MEDLINE and relative probability of co-occurrences of all gene-gene and gene-keyword pairs determined. To assess gene clustering according to literature co-publication, 150 genes consisting of 8 sets with known connections (same pathway, same protein complex, or same cellular localization, etc.) were run through the program. Receiver operator characteristics (ROC) analyses showed that most gene sets were clustered much better than expected by random chance. To test grouping of genes from real microarray data, 221 differentially expressed genes from a microarray experiment were analyzed with CoPub Mapper, which resulted in several relevant clusters of genes with biological process and disease keywords. In addition, all genes versus keywords were hierarchical clustered to reveal a complete grouping of published genes based on co-occurrence.

Conclusion: The CoPub Mapper program allows for quick and versatile querying of co-published genes and keywords and can be successfully used to cluster predefined groups of genes and microarray data.

PubMed Disclaimer

Figures

**Figure 1**
Flow diagram of the processing and curation of the gene names, symbols and aliases. Gene names, symbols and aliases were retrieved from Affymetrix HG_U95 / HG_U133 and the HUGO databases.

**Figure 2**
Clustered view of gene co-occurrences among a collection of 8 groups of selected genes. Of the 150 genes, the relative scores of co-occurrences were calculated and clustered using hierarchical clustering. A co-occurrence was only taken into account when at least two articles mention the gene-gene pair. Using this criterion, 45 genes did not co-publish with any of the other 149 genes. To which group (Table 2) a gene belongs to is indicated in the right part of the figure. Image contrast in TreeView was set at 50. Scaled (1–100) relative scores are represented in a red spectrum with bright red being the highest score. A relative score of zero or no score are in black.

**Figure 3**
Receiver operating characteristics (ROC) of the 8 selected groups of genes to quantify their coherence upon clustering of literature co-occurrences. Co-occurrences of the 150 genes were determined with the genes themselves, or the 5 different keyword thesauri. A co-occurrence was only taken into account when at least two articles mention the gene-gene or gene-keyword pair. The co-occurrence matrixes were Pearson correlation clustered and the distances between genes determined. For each gene, it was determined whether the next closest clustered gene was a group member. Genes from the same group were scored as true positive and any other gene as false positive to generate a ROC curve. For each gene, the area under the ROC curve (AUC) was determined and the median of all the group members per group ± SD depicted. Scaling is from an AUC of 0.3 to 1. An AUC of 0.5, representing a random ordering is highlighted with a thick line.

**Figure 4**
Hierarchical clustering of literature co-occurences of 104 genes (rows) versus 761 biological processes and diseases (columns). A co-occurrence was only taken into account when at least three articles mention the gene-keyword pair. Hierarchical clustering of CoPub Mapper results using genes differentially expressed in PCOS ovaries. From 221 regulated genes 104 genes contain a gene name, symbol or alias and produce a gene-keyword pair with biological processes or diseases. 104 modulated genes returned 761 keywords denoting biological processes or diseases. Hierarchical clustering was performed using Spotfire using the Complete Linkage method and Correlation as Similarity Measure. Several subclusters were identified shown here with blue boxes; between parenthesis the number of genes in a cluster. A: PCOS, Obesity, Insulin Resistance (4); B & D: Gametogenesis (5&8); C: Cell adhesion, Angiogenesis (19); E & H: Immune response, Inflammation (14&11); F: Cancer, Cell growth, Differentiation (32); G: Inflammatory diseases (6).

**Figure 5**
Hierarchical clustering of literature co-occurrences of 5626 genes (rows) versus 1275 diseases (columns). A co-occurrence was only taken into account when at least two articles mention the gene-disease pair. Each gene had to have at least once a high (1–100 scaled) relevance score of >46. A: Overview of all 5626 genes and 1275 diseases. B: Enlargement of a small subsection of genes showing the amount of detail present in the CoPub Mapper analysis.

**Figure 6**
Hierarchical clustering of literature co-occurrences of 1135 genes (rows) versus 177 cellular components (columns). A co-occurrence was only taken into account when at least three articles mention the gene-cellular component pair. Each gene had to have at least twice a high (1–100 scaled) relevance score of >50. Relative scores of less then 50 were masked in the TreeView program. Some of the cellular component concepts responsible for clustering of genes are indicated.

See this image and copyright information in PMC

Cited by

Discovery of disease- and drug-specific pathways through community structures of a literature network.
Pham M, Wilson S, Govindarajan H, Lin CH, Lichtarge O. Pham M, et al. Bioinformatics. 2020 Mar 1;36(6):1881-1888. doi: 10.1093/bioinformatics/btz857. Bioinformatics. 2020. PMID: 31738408 Free PMC article.
Functional variants identify sex-specific genes and pathways in Alzheimer's Disease.
Bourquard T, Lee K, Al-Ramahi I, Pham M, Shapiro D, Lagisetty Y, Soleimani S, Mota S, Wilhelm K, Samieinasab M, Kim YW, Huh E, Asmussen J, Katsonis P, Botas J, Lichtarge O. Bourquard T, et al. Nat Commun. 2023 May 13;14(1):2765. doi: 10.1038/s41467-023-38374-z. Nat Commun. 2023. PMID: 37179358 Free PMC article.
Gene regulatory networks in lactation: identification of global principles using bioinformatics.
Lemay DG, Neville MC, Rudolph MC, Pollard KS, German JB. Lemay DG, et al. BMC Syst Biol. 2007 Nov 27;1:56. doi: 10.1186/1752-0509-1-56. BMC Syst Biol. 2007. PMID: 18039394 Free PMC article.
CoPub: a literature-based keyword enrichment tool for microarray data analysis.
Frijters R, Heupers B, van Beek P, Bouwhuis M, van Schaik R, de Vlieg J, Polman J, Alkema W. Frijters R, et al. Nucleic Acids Res. 2008 Jul 1;36(Web Server issue):W406-10. doi: 10.1093/nar/gkn215. Epub 2008 Apr 28. Nucleic Acids Res. 2008. PMID: 18442992 Free PMC article.
Linking genes to literature: text mining, information extraction, and retrieval applications for biology.
Krallinger M, Valencia A, Hirschman L. Krallinger M, et al. Genome Biol. 2008;9 Suppl 2(Suppl 2):S8. doi: 10.1186/gb-2008-9-s2-s8. Epub 2008 Sep 1. Genome Biol. 2008. PMID: 18834499 Free PMC article. Review.

See all "Cited by" articles

References

1. Brown PO, Botstein D. Exploring the new world of the genome with DNA microarrays. Nat Genet. 1999;21:33–37. doi: 10.1038/4462. - DOI - PubMed
1. Duggan DJ, Bittner M, Chen Y, Meltzer P, Trent JM. Expression profiling using cDNA microarrays. Nat Genet. 1999;21:10–14. doi: 10.1038/4434. - DOI - PubMed
1. de Bruijn B, Martin J. Getting to the (c)ore of knowledge: mining biomedical literature. Int J Med Inf. 2002;67:7–18. doi: 10.1016/S1386-5056(02)00050-3. - DOI - PubMed
1. Hirschman L, Park JC, Tsujii J, Wong L, Wu CH. Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002;18:1553–1561. doi: 10.1093/bioinformatics/18.12.1553. - DOI - PubMed
1. Mack R, Hehenberger M. Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discov Today. 2002;7:S89–S98. doi: 10.1016/S1359-6446(02)02286-9. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CoPub Mapper: mining MEDLINE based on search term co-publication

Affiliation

CoPub Mapper: mining MEDLINE based on search term co-publication

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources