Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Winter;3(1):174-193.
doi: 10.1162/qss_a_00181. Epub 2022 Apr 12.

The structural shift and collaboration capacity in GenBank Networks: A longitudinal study

Affiliations

The structural shift and collaboration capacity in GenBank Networks: A longitudinal study

Jian Qin et al. Quant Sci Stud. 2022 Winter.

Abstract

Metadata in scientific data repositories such as GenBank contain links between data submissions and related publications. As a new data source for studying collaboration networks, metadata in data repositories compensate for the limitations of publication-based research on collaboration networks. This paper reports the findings from a GenBank metadata analytics project. We used network science methods to uncover the structures and dynamics of GenBank collaboration networks from 1992-2018. The longitudinality and large scale of this data collection allowed us to unravel the evolution history of collaboration networks and identify the trend of flattening network structures over time and optimal assortative mixing range for enhancing collaboration capacity. By incorporating metadata from the data production stage with the publication stage, we uncovered new characteristics of collaboration networks as well as developed new metrics for assessing the effectiveness of enablers of collaboration-scientific and technical human capital, cyberinfrastructure, and science policy.

Keywords: GenBank metadata analysis; collaboration capacity; collaboration networks; impact assessment; longitudinal study of collaboration networks.

PubMed Disclaimer

Conflict of interest statement

COMPETING INTERESTS The authors have no competing interests.

Figures

Figure 1.
Figure 1.
The metadata section in a GenBank annotation record.
Figure 2.
Figure 2.
Distributions of alpha values and mean degrees for both publication and sequence data submission networks in GenBank 1992–2018. The alpha values for both networks appear to be almost identical, while the mean degree values for publication network have been consistently higher than that of the submission network. (The data used to generate this chart are in Table S1. In this paper, table and figure numbers with an S mean they are in Supplementary Materials).
Figure 3.
Figure 3.
Giant component size changes from 1992–2018 have been steadily growing. The growth in the percentage of edges has outpaced that of the nodes. See Table S2 for the data used to draw this plot.
Figure 4.
Figure 4.
GenBank network visualization from 1992–2018: Each network represents 1 year of the data and includes the merged data submission and publication coauthor networks. Nodes that only showed up in the publication network are blue with green links. Nodes that only showed up in the data submission network are dark red, with red links. Nodes that showed up in both networks are purple with dark purple links between them. To observe the main structures, we are focused on the giant component for each year; thus isolates and disconnected clusters have been removed. Larger-size visualizations of yearly network structure changes can be seen from Movie S2 in Supplementary Materials.
Figure 5.
Figure 5.
Distribution of clustering coefficient and average assortativity for publication and data submission networks from 1992–2018. (See Table S3 for data used to generate this plot.)
Figure 6.
Figure 6.
Assortative mixing for 2002 and 2012: (a) A densely connected cluster and a sparsely connected region in 2002. There appear to be few connections between the nodes with high assortativity mixing (in red) and those with low assortativity mixing (in blue), similar to the outer region with sparsely connected nodes. (b) The densely connected cluster shows stronger mixing between high and low assortativity in 2012, while the sparsely connected outer region appears to have little mixing between high and low assortativity.
Figure 6.
Figure 6.
Assortative mixing for 2002 and 2012: (a) A densely connected cluster and a sparsely connected region in 2002. There appear to be few connections between the nodes with high assortativity mixing (in red) and those with low assortativity mixing (in blue), similar to the outer region with sparsely connected nodes. (b) The densely connected cluster shows stronger mixing between high and low assortativity in 2012, while the sparsely connected outer region appears to have little mixing between high and low assortativity.
Figure 7.
Figure 7.
Assortativity vs. collaboration capacity. The relationship between assortativity and collaboration capacity is consistently positive, as reflected in the 2002 and 2012 snapshots of the author-level statistics. The heat map color spectrum shown in the graphs shows the density of the values, that is, the frequency of the values, around the mean (vertical red line of ~0.3 assortativity score).
Figure 7.
Figure 7.
Assortativity vs. collaboration capacity. The relationship between assortativity and collaboration capacity is consistently positive, as reflected in the 2002 and 2012 snapshots of the author-level statistics. The heat map color spectrum shown in the graphs shows the density of the values, that is, the frequency of the values, around the mean (vertical red line of ~0.3 assortativity score).
Figure 8.
Figure 8.
Average ratio of data submissions to publications: 1992–2018. The increment up to 2003 coincided with the Human Genome Project ending in 2003. See Table S4 for the data used to generate this plot.
Figure 9.
Figure 9.
Change in the number and percentage of authors in data submission and publication networks from 1992–2018. Note that the percentage for each group does not add up to 100% because of the overlap of authors in the data submission and publication networks. The unique publication author count and unique submission author count are calculated as the total. The overlap, then, is an intersection of the two networks (publication and submission), so the “percentage intersected” includes authors from each network’s unique author counts. The data used to draw this plot are available in Table S5.

References

    1. Albert R, & Barabási AL (2002). Statistical mechanics of complex networks. Review of Modern Physics, 74(1), 47–97. 10.1103/RevModPhys.74.47 - DOI
    1. Alekseyev YO, Fazeli R, Yang S, Basran R, Maher T, …Remick D (2018). A next-generation sequencing primer—How does it work and what can it do? Academic Pathology, 5. 10.1177/2374289518766521 - DOI - PMC - PubMed
    1. Barabási AL (2009). Scale-free networks: A decade and beyond. Science, 325(5939), 412–413. 10.1126/science.1173299 - DOI - PubMed
    1. Barabási AL (2016). Network science. Cambridge: CambridgeUniversity Press.
    1. Barabási A-L, & Albert R (1999). Emergence of scaling in random networks. Science, 286(5439), 509–512. 10.1126/science.286.5439.509 - DOI - PubMed

LinkOut - more resources