Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 28;11(4):e0154404.
doi: 10.1371/journal.pone.0154404. eCollection 2016.

Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods

Affiliations

Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods

Lovro Šubelj et al. PLoS One. .

Abstract

Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Pair-wise distances between the clusterings obtained by the considered methods.
Panel A shows the heatmaps of clustering distances for the Scientometrics citation network, where the methods are clustered into 5 and 11 classes (left- and right-hand side, respectively). Note that this merely implies the ordering of the rows/columns. Insets on the right show the method silhouette coefficients. Panel B shows the same for the Library & Information Science citation network. See Methods for the definition of the clustering distance and text for the details of the method clustering procedure.
Fig 2
Fig 2. Size distributions of the clusterings obtained by representative methods.
Panels A and B show cluster size distributions P(s) for the Library & Information Science and Physics citation networks, respectively. Wherever plausible, the power-laws sγ are fitted to the tails of the distributions by maximum likelihood estimation, γ = 1 + n (∑i log si/smin) for smin > 1.
Fig 3
Fig 3. Robustness of the clusterings obtained by representative methods.
Panels A and B show clustering robustness plots V(α) for the Scientometrics and Library & Information Science citation networks, respectively. These show the distances between the clusterings obtained after randomly rewiring α links. See Methods for the definitions of clustering distance and robustness.
Fig 4
Fig 4. Degeneracy of the clusterings obtained by representative methods.
Panels A and B show clustering degeneracy diagrams D for the Library & Information Science and Physics citation networks, respectively. These display the non-degenerate ranges of the clusterings, while the percentages show the fraction of nodes in tiny clusters ∑si < stiny si/n and in the largest cluster sL/n (left- and right-hand side, respectively). See text for the definition of clustering degeneracy.
Fig 5
Fig 5. Alluvial diagram of the clusterings obtained by the map equation methods Metimap and Infomap.
The diagram shows the overlap between the largest scientometric clusters returned by Metimap and Infomap on the Library & Information Science citation network (left and right, respectively). ‘Remaining publications’ are included in one of the clusters in the Metimap (Infomap) clustering but not included in any of the clusters in the Infomap (Metimap) clustering. See Table 5 for details of the clusterings.
Fig 6
Fig 6. Size distributions and degeneracy of the clusterings obtained by the selected methods.
The methods with and without post-processing are applied to the Physics citation network, while the panels A and B show cluster size distributions P(s) and clustering degeneracy diagrams D, respectively. Vertical lines in panel A represent the threshold size stiny = 15. See text for the definition of clustering degeneracy and Methods for the details of the clustering post-processing approach.
Fig 7
Fig 7. Sizes and coverage of the largest clusters obtained by the selected methods.
The methods with and without post-processing are applied to the All Fields citation network, while the panels A and B show the sizes s and coverage K/k of the largest 50 clusters, respectively. Horizontal lines in panel A represent the threshold size sgiant = 104. See text for the definition of cluster coverage.

Similar articles

Cited by

References

    1. Fortunato S. Community detection in graphs. Phys Rep. 2010;486(3–5):75–174. 10.1016/j.physrep.2009.11.002 - DOI
    1. Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, et al. Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoS ONE. 2011;6(3):e18029 10.1371/journal.pone.0018029 - DOI - PMC - PubMed
    1. Janssens F, Leta J, Glänzel W, De Moor B. Towards mapping library and information science. Inform Process Manag. 2006;42(6):1614–1642. 10.1016/j.ipm.2006.03.025 - DOI
    1. Boyack KW, Klavans R. Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? J Am Soc Inf Sci Tec. 2010;61(12):2389–2404. 10.1002/asi.21419 - DOI
    1. Jarneving B. Bibliographic coupling and its application to research-front and other core documents. J Infometr. 2007;1(4):287–307. 10.1016/j.joi.2007.07.004 - DOI

Publication types

LinkOut - more resources