Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 18;16(9):e2006643.
doi: 10.1371/journal.pbio.2006643. eCollection 2018 Sep.

Large-scale investigation of the reasons why potentially important genes are ignored

Affiliations

Large-scale investigation of the reasons why potentially important genes are ignored

Thomas Stoeger et al. PLoS Biol. .

Abstract

Biomedical research has been previously reported to primarily focus on a minority of all known genes. Here, we demonstrate that these differences in attention can be explained, to a large extent, exclusively from a small set of identifiable chemical, physical, and biological properties of genes. Together with knowledge about homologous genes from model organisms, these features allow us to accurately predict the number of publications on individual human genes, the year of their first report, the levels of funding awarded by the National Institutes of Health (NIH), and the development of drugs against disease-associated genes. By explicitly identifying the reasons for gene-specific bias and performing a meta-analysis of existing computational and experimental knowledge bases, we describe gene-specific strategies for the identification of important but hitherto ignored genes that can open novel directions for future investigation.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Physical, chemical, and biological features of genes predict the number of publications.
(A) Illustration of modeling approach and prediction of number of research publications for single genes using information on 430 physical, chemical, and biological features of genes (S1 Data). (B) Research publications on individual genes grouped by t-SNE visualization using the 15 features most important to the models used in (A). Heatmaps show z-scored values of the 15 features for the genes in each cluster. Order of features as indicated in S3A Fig (S1 Data). SRP, signal recognition particle; t-SNE, t-distributed stochastic neighbor embedding.
Fig 2
Fig 2. Features of genes and homologous genes predict discovery of human genes.
(A) Number of publications per gene for past and recent research. Publications of past research (until 2010) are scaled so that the total number of publications matches present research (2011–2015). Dashed grey lines delimit three standard deviations away from the mean. (B) Prediction of the number of research publications for the model of Fig 1A extended by including the year of the first publication on the specific human gene (S1 Data). (C) Prediction of the year of discovery using the features from Fig 1A (S1 Data). (D) Percentage of publications that cite publications with nonhuman genes more frequently than they cite publications with human genes (S1 Data). (E) Prediction of the year of initial publications on individual genes using the features from Fig 1A and the year of the initial publication on homologous genes of nonhuman model organisms (S1 Data). (F) Prediction of the number of research publications using the features of Fig 1A and the number of publications on homologous genes (S1 Data).
Fig 3
Fig 3. Many potentially important genes are not being studied enough.
(A) Relative enrichment of the presence of genes with genetic loss-of-function (LoF) intolerance, presence of genes with GWAS traits, and the attention within publications. (B) Predicted versus actual NIH budget spending on individual genes (dots). The black line shows a lowess fit and the dashed lines show the two distinct regimes of the prediction (S1 Data). (C) Fraction of disease-linked genes with at least one experimental drug conditioned on the predicted order of discovery according to the model shown in Fig 2B. Error bars show 95% confidence intervals for the estimations. GWAS, genome-wide association study; LoF, loss-of-function; NIH, National Institutes of Health; USD, US dollar.
Fig 4
Fig 4. Identifying and exploring ignored genes.
(A) Estimation of the years until all genes are studied if scientific enterprise continues to follow trends reported above. Number of genes with at least n focused (single-gene) publications per year. Dashed lines show extrapolation of the bounds of linear regression for recent years. (B) Percentage of highly cited studies (top 5% in number of citations) in the 8 years following their publication. Error bars show 95% confidence intervals. (C) Percentage of genes with a strong RNAi phenotype, at least one tissue with moderate RNA abundance, presence of a Drosophila melanogaster homolog, or membership in a complex with highly studied genes. Highly studied genes show higher percentages for all these characteristics, but many unstudied genes also share those characteristics. (D) Illustration of bias in identification of hits in distinct large-scale experimental approaches. Interaction studies refer to studies labelled as “High throughput” within BioGRID. Relative hits marks fold enrichment over equal occurrence (S1 Data). (E) Genes grouped by t-SNE visualization using the 15 features most important to the models used in Fig 1A. Large circles highlight genes with frequently discovered GWAS traits. Heatmaps show presence of strong genetic evidence (G), experimental potential (E), and homolog in invertebrate model organism (M). Note the lack of a strong correlation between GEM characteristics and research attention. E, experimental potential; FPKM, fragments per kilobase of transcript per million mapped reads; G, strong genetic support; GEM, strong genetic support and experimental potential and homolog in invertebrate model organism; GWAS, genome-wide association study; M, model organism; RNAi, RNA interference; t-SNE, t-distributed stochastic neighbor embedding.

Comment in

  • Reply to "Far away from the lamppost".
    Stoeger T, Gerlach M, Morimoto RI, Amaral LAN. Stoeger T, et al. PLoS Biol. 2018 Dec 11;16(12):e3000075. doi: 10.1371/journal.pbio.3000075. eCollection 2018 Dec. PLoS Biol. 2018. PMID: 30532190 Free PMC article.
  • Far away from the lamppost.
    Oprea TI, Jan L, Johnson GL, Roth BL, Ma'ayan A, Schürer S, Shoichet BK, Sklar LA, McManus MT. Oprea TI, et al. PLoS Biol. 2018 Dec 11;16(12):e3000067. doi: 10.1371/journal.pbio.3000067. eCollection 2018 Dec. PLoS Biol. 2018. PMID: 30532236 Free PMC article.

References

    1. Hoffmann R, Valencia A. Life cycles of successful genes. Trends Genet. 2003;19(2):79–81. Epub 2003/01/28. . - PubMed
    1. Pfeiffer T, Hoffmann R. Temporal patterns of genes in scientific publications. Proc Natl Acad Sci U S A. 2007;104(29):12052–6. 10.1073/pnas.0701315104 ; PubMed Central PMCID: PMCPMC1924584. - DOI - PMC - PubMed
    1. Su AI, Hogenesch JB. Power-law-like distributions in biomedical publications and research funding. Genome Biol. 2007;8(4):404 Epub 2007/05/03. 10.1186/gb-2007-8-4-404 ; PubMed Central PMCID: PMCPMC1895997. - DOI - PMC - PubMed
    1. Gans Joshua MF, Stern Scott. Patents, Papers, Pairs & Secrets: Contracting over the disclosure of scientific knowledge. Statement is only present in self-hosted early draft: http://fmurray.scripts.mit.edu/docs/Gans.Murray.Stern%20_KnowledgeDisclo.... 2008 [cited 2018 Aug 22].
    1. Grueneberg DA, Degott S, Pearlberg J, Li W, Davies JE, Baldwin A, et al. Kinase requirements in human cells: I. Comparing kinase requirements across various cell types. P Natl Acad Sci USA. 2008;105(43):16472–7. 10.1073/pnas.0808019105 PubMed PMID: WOS:000260913500015. - DOI - PMC - PubMed

Publication types