Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016;5(1):7.
doi: 10.1140/epjds/s13688-016-0068-2. Epub 2016 Mar 3.

Homophily and missing links in citation networks

Affiliations

Homophily and missing links in citation networks

Valerio Ciotti et al. EPJ Data Sci. 2016.

Abstract

Citation networks have been widely used to study the evolution of science through the lenses of the underlying patterns of knowledge flows among academic papers, authors, research sub-fields, and scientific journals. Here we focus on citation networks to cast light on the salience of homophily, namely the principle that similarity breeds connection, for knowledge transfer between papers. To this end, we assess the degree to which citations tend to occur between papers that are concerned with seemingly related topics or research problems. Drawing on a large data set of articles published in the journals of the American Physical Society between 1893 and 2009, we propose a novel method for measuring the similarity between articles through the statistical validation of the overlap between their bibliographies. Results suggest that the probability of a citation made by one article to another is indeed an increasing function of the similarity between the two articles. Our study also enables us to uncover missing citations between pairs of highly related articles, and may thus help identify barriers to effective knowledge flows. By quantifying the proportion of missing citations, we conduct a comparative assessment of distinct journals and research sub-fields in terms of their ability to facilitate or impede the dissemination of knowledge. Findings indicate that Electromagnetism and Interdisciplinary Physics are the two sub-fields in physics with the smallest percentage of missing citations. Moreover, knowledge transfer seems to be more effectively facilitated by journals of wide visibility, such as Physical Review Letters, than by lower-impact ones. Our study has important implications for authors, editors and reviewers of scientific journals, as well as public preprint repositories, as it provides a procedure for recommending relevant yet missing references and properly integrating bibliographies of papers.

Keywords: bibliometric techniques; citation networks; homophily; link prediction.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Quantifying the similarity between two articles based on their bibliographies. The similarity between two articles can be defined in terms of the overlap between their reference lists. The two articles P1 and P2 in panel (a) share only one citation; they should therefore be considered less similar than articles P3 and P4 in panel (b) which share four citations. This difference can be captured by the Jaccard index, which is equal to 0.2 in the former case and to 1.0 in the latter. However, the Jaccard index is equal to 1.0 also for the two articles in panel (c), which instead share only two citations. If citations are interpreted as proxies for knowledge flows, then the similarity between articles P7 and P8 in panel (d), which cite a highly-cited article, should be smaller than the similarity between articles P9 and P10 in panel (e), which instead are the only two articles citing P11. Our similarity measure, based on statistical validation, properly takes these heterogeneities into account.
Figure 2
Figure 2
The probability Pij(p) to observe a citation between two articles whose bibliographies overlap is statistically significant at the threshold value p . Notice that Pij(p) increases as the statistical threshold p decreases. That is, citations between pairs of articles characterised by a highly significant overlap tend to occur with a higher likelihood than citations between articles whose reference lists are not significantly similar. The inset shows how the number of pairs of articles characterised by a statistically significant similarity at a given threshold p varies with p.
Figure 3
Figure 3
Lack of knowledge flows. An example of several validated pairs of articles in the APS citation network at p=107. Articles are reported in order of publication time, from older (left) to more recent (right) ones. The occurrence of a link indicates that the pair of articles has passed the statistical test, while the colour of the link indicates that the most recent article in the pair actually did (green) or did not (red) cite the other one. In this case, all the articles represented as yellow nodes are articles co-authored by researchers in the same group, while article A was co-authored by another group. The identification of a large number of missing citations suggests that the two groups might have been unaware of the work of their colleagues in the same field.
Figure 4
Figure 4
Ranking journals and sub-fields by lack of knowledge flows. The analysis of missing links restricted to specific sub-fields of physics or single APS journals confirms that the tendency of a citation to occur between a pair of articles increases with the similarity between the bibliographies of the two articles. Panels (a)-(b) show the plots of U(p)=1Pij(p) for different sub-graphs corresponding to (a) two families of PACS codes, namely 40 (electromagnetism) and 50 (Gases and Plasmas), and (b) two APS journals, namely Physical Review Letters and Physical Review C. In panel (c) we sketch the procedure adopted to compute the estimate U˜0: we consider the line tangent to the curve U(p) at the smallest value of the statistical threshold p for which we still have a relatively substantial number of validated pairs (in this case, p=107), and we define U˜0 as the value of the intercept at p=0 of that line. In panels (d) and (e) we show, respectively, the rankings of sub-fields and APS journals based on the values of U˜0. Notice that Electromagnetism and Interdisciplinary physics are the two sub-fields with the smallest percentage of missing links, i.e., those in which knowledge among articles flows effectively and as would be expected if citations were driven by overlaps between topics or research problems. Interestingly, the lack of knowledge flows between articles published in Physical Review C (U˜00.27) is almost nine times as large as the one identified in Physical Review Letters (U˜00.03), which is the APS journal with the widest visibility and largest impact.

References

    1. Barabási AL, Albert R. Emergence of scaling in random networks. Science. 1999;286(5439):509–512. doi: 10.1126/science.286.5439.509. - DOI - PubMed
    1. Klimt B, Yang Y. First conference on email and anti-spam (CEAS) 2004. Introducing the Enron corpus.
    1. Eagle N, Pentland AS, Lazer D. Inferring friendship network structure by using mobile phone data. Proc Natl Acad Sci USA. 2009;106(36):15274–15278. doi: 10.1073/pnas.0900282106. - DOI - PMC - PubMed
    1. Larsen PO, Von Ins M. The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics. 2010;84(3):575–603. doi: 10.1007/s11192-010-0202-z. - DOI - PMC - PubMed
    1. Leicht EA, Clarkson G, Shedden K, Newman ME. Large-scale structure of time evolving citation networks. Eur Phys J B. 2007;59(1):75–83. doi: 10.1140/epjb/e2007-00271-7. - DOI

LinkOut - more resources