Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2021 Feb 23;12(2):319.
doi: 10.3390/genes12020319.

Meta-Analysis of Gene Popularity: Less Than Half of Gene Citations Stem from Gene Regulatory Networks

Affiliations
Meta-Analysis

Meta-Analysis of Gene Popularity: Less Than Half of Gene Citations Stem from Gene Regulatory Networks

Ionut Sebastian Mihai et al. Genes (Basel). .

Abstract

The reasons for selecting a gene for further study might vary from historical momentum to funding availability, thus leading to unequal attention distribution among all genes. However, certain biological features tend to be overlooked in evaluating a gene's popularity. Here we present a meta-analysis of the reasons why different genes have been studied and to what extent, with a focus on the gene-specific biological features. From unbiased datasets we can define biological properties of genes that reasonably may affect their perceived importance. We make use of both linear and nonlinear computational approaches for estimating gene popularity to then compare their relative importance. We find that roughly 25% of the studies are the result of a historical positive feedback, which we may think of as social reinforcement. Of the remaining features, gene family membership is the most indicative followed by disease relevance and finally regulatory pathway association. Disease relevance has been an important driver until the 1990s, after which the focus shifted to exploring every single gene. We also present a resource that allows one to study the impact of reinforcement, which may guide our research toward genes that have not yet received proportional attention.

Keywords: Matthew effect; biological feature; gene; gene regulatory networks; genomics; linear model; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
(a) Features defining the total model of reinforcement and the datasets used. (b) Citations and discovered genes over time, along with landmark events in genomic biology research. #Citations are log10(number of citations + 1). #New genes are log10 (number of genes + 1). Gene citation count distributions for all cell types (c) and for T lymphocytes (d). Overall gene publication frequencies follow a log-normal-like distribution, while when applying a cellular-type context, a Pareto-like trend appears. (e) Random subset of papers, similar to the number of papers for T cells in d. (f) A fit with an exponential distribution. This shows the super-exponential nature of citations. (g) Fit with the pareto distribution. This shows similarities but not a perfect fit either. We conclude that citations follow some intermediate distribution between these two cases. (h) Features and their relative weight contribution to the total fitted model. The order of features does not matter. Features do not sum to 1 but the input features are variance-normalized. Age is negative as we use the year of publication as a feature, which after normalization is negative age. (i) Spearman’s correlation coefficients between the included features for the total fitted model.
Figure 2
Figure 2
(a) Citation trends with relative #citations vs. gene family member’s index for the first 10 indices. Citations within gene families are normalized to index 1, which constitutes ffounder. (b) Gene–gene graph representation. The average citations of neighboring genes in a gene–gene graph is used to define several features for each gene: fcoexp, fPPI, and fchromatin. (c) fchromatin defined as #citations vs. ranked chromosome position of gene, for chromosome 7, colored by some gene families named at the bottom*. Chromosomal position is a highly dimensional feature that captures several relevant biological parameters and strongly influences the #citations. Gene families tend to show similar patterns of citation. (d) #Citations vs. RNA expression level (primary tissue normalized) for T lymphocytes. Highly expressed genes generally tend to positively correlate with #citations. (e) #Citations vs. Pearson correlation values of RNA expression—#citations across cell types. The positive correlation trend observed in Figure 2d is consistent across cell types, thus reinforcing the idea of gene expression levels being a critical feature in gene popularity. (f) #Citations vs. cell type-specific expression level for the gene Oct4. Despite being highly expressed on professional antigen presenting cells, the cellular context in which Oct4 has been extensively cited is stem cell research. This hints at the existence of underlying features not included within this study that might be paramount drivers of gene popularity. (g) fessentiality defined as #Citations vs. cellular essentiality. Gene essentiality shows a positive correlation trend with gene popularity. This highlights that genes important for basic cell biology tend to produce a phenotype, which in turn facilitates gene reporting and enhances popularity. (h) UMAP projection of single-cell RNA-seq data showing the co-expression network (T cells, each point is a gene) colored by expression level and number of citations. A group of highly cited genes is pointed out in red.
Figure 3
Figure 3
(a) Time series plot for several features where each point represents the year of first citation of a gene, from 1900 to 2010. Reinforcement sources for genes tend to vary over time, hinting at the existence of underlying social features for gene popularity (not included in this study), highly dependent on time. (b) Time series plot for all model-relevant feature weights for genes discovered between 1970–1990 and 1991–2010. Certain features like expression and essentially seem to be especially relevant as popularity determinators for genes discovered after 1990. (c) A summary of the proposed model of gene popularity reinforcers showing the total percentage of different sources of reinforcement.

Similar articles

Cited by

References

    1. Stoeger T., Gerlach M., Morimoto R.I., Amaral L.A.N. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16:e2006643. doi: 10.1371/journal.pbio.2006643. - DOI - PMC - PubMed
    1. Mingers J., Leydesdorff L. A review of theory and practice in scientometrics. Eur. J. Oper. Res. 2015;246:1–19. doi: 10.1016/j.ejor.2015.04.002. - DOI
    1. Bailón-Moreno R., Jurado-Alameda E., Ruiz-Baños R., Courtial J.P. Bibliometric laws: Empirical flaws of fit. Scientometrics. 2005;63:209–229. doi: 10.1007/s11192-005-0211-5. - DOI
    1. Kim S., Yeganova L., Wilbur W.J. Meshable: Searching PubMed abstracts by utilizing MeSH and MeSH-derived topical terms. Bioinformatics. 2016;32:3044–3046. doi: 10.1093/bioinformatics/btw331. - DOI - PMC - PubMed
    1. Venables W.N., Ripley B.D. Modern Applied Statistics with S. Statistics and Computing. 4th ed. Springer; Berlin, Germany: 2002.

Publication types

LinkOut - more resources