Comparative Study

. 2011 Mar 17;6(3):e18029.

doi: 10.1371/journal.pone.0018029.

Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches

Kevin W Boyack¹, David Newman, Russell J Duhon, Richard Klavans, Michael Patek, Joseph R Biberstine, Bob Schijvenaars, André Skupin, Nianli Ma, Katy Börner

Affiliations

PMID: 21437291
PMCID: PMC3060097
DOI: 10.1371/journal.pone.0018029

Comparative Study

Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches

Kevin W Boyack et al. PLoS One. 2011.

. 2011 Mar 17;6(3):e18029.

doi: 10.1371/journal.pone.0018029.

Authors

Kevin W Boyack¹, David Newman, Russell J Duhon, Richard Klavans, Michael Patek, Joseph R Biberstine, Bob Schijvenaars, André Skupin, Nianli Ma, Katy Börner

Affiliation

¹ SciTech Strategies, Inc., Albuquerque, New Mexico, United States of America. kboyack@mapofscience.com

PMID: 21437291
PMCID: PMC3060097
DOI: 10.1371/journal.pone.0018029

Abstract

Background: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.

Methodology: We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.

Conclusions: PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have the following interests to declare: Authors Boyack, Klavans, and Patek are all employed by SciTech Strategies, Inc.; author Schijvenaars is employed by Collexis, Inc. No patent or product based on this work is under development. This commercial affiliation does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials. Data from this study are available for download at http://sci.slis.indiana.edu/sts/.

Figures

**Figure 1. Cluster size distributions for the nine similarity approaches.**

**Figure 2. Textual coherence values by cluster size for the nine similarity approaches.**
Coherence is a measure of cluster quality. A higher value of coherence indicates a higher degree of textual similarity between the titles and abstracts within a cluster than does a lower value of coherence. Data are shown for cluster size bins of at least 15 clusters.

**Figure 3. Precision-recall curves for each cluster solution based on grant-to-article linkages.**
To calculate precision-recall, clusters are first ordered by the fraction of articles referencing an NIH grant. Precision is the cumulative fraction of articles referencing the NIH grants, while recall is the cumulative fraction of articles in the cluster solution.

**Figure 4. Two-dimensional map of the PMRA cluster solution, representing nearly 29,000 clusters and over two million articles.**
The map was generated with cluster-to-cluster similarity values using the DrL graph layout routine . Color legend: Chemistry (blue), Engineering (cyan), Biology (green), Biotechnology (teal), Infectious Disease (brick red), Medical Specialties (red), Health Services (peach), Brain (orange), Social Sciences (yellow), Computer Sciences (pink).

See this image and copyright information in PMC

References

1. Cooper WS. On selecting a measure of retrieval effectiveness. Journal of the American Society for Information Science. 1973;24:87–100.
1. Robertson SE, Sparck Jones K. Relevance weighting of search terms. Journal of the American Society for Information Science. 1976;27:129–146.
1. Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management. 1988;24:513–523.
1. Belkin NJ, Kantor P, Fox EA, Shaw JA. Combining the evidence of multiple query representations for information retrieval. Information Processing & Management. 1995;31:431–448.
1. Jardine N, van Rijsbergen CJ. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval. 1971;7:217–240.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

HHSN268200900053C/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches

Affiliation

Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous