Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov;18(21-22):e1800093.
doi: 10.1002/pmic.201800093. Epub 2018 Oct 30.

Darkness in the Human Gene and Protein Function Space: Widely Modest or Absent Illumination by the Life Science Literature and the Trend for Fewer Protein Function Discoveries Since 2000

Affiliations

Darkness in the Human Gene and Protein Function Space: Widely Modest or Absent Illumination by the Life Science Literature and the Trend for Fewer Protein Function Discoveries Since 2000

Swati Sinha et al. Proteomics. 2018 Nov.

Abstract

The mentioning of gene names in the body of the scientific literature 1901-2017 and their fractional counting is used as a proxy to assess the level of biological function discovery. A literature score of one has been defined as full publication equivalent (FPE), the amount of literature necessary to achieve one publication solely dedicated to a gene. It has been found that less than 5000 human genes have each at least 100 FPEs in the available literature corpus. This group of elite genes (4817 protein-coding genes, 119 non-coding RNAs) attracts the overwhelming majority of the scientific literature about genes. Yet, thousands of proteins have never been mentioned at all, ≈2000 further proteins have not even one FPE of literature and, for ≈4600 additional proteins, the FPE count is below 10. The protein function discovery rate measured as numbers of proteins first mentioned or crossing a threshold of accumulated FPEs in a given year has grown until 2000 but is in decline thereafter. This drop is partially offset by function discoveries for non-coding RNAs. The full human genome sequencing does not boost the function discovery rate. Since 2000, the fastest growing group in the literature is that with at least 500 FPEs per gene.

Keywords: complete human genome; gene function discovery; protein functions; scientific literature analysis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Status of the mapping of life science literature accumulated until 2017 onto the human genome. For various FPE ranges (0 < FPE < 1, 1 ≤ FPE < 10, 10 ≤ FPE < 20, 20 ≤ FPE < 30, 30 ≤ FPE < 40, 40 ≤ FPE < 50, 50 ≤ FPE < 60, 60 ≤ FPE < 70, 70 ≤ FPE < 80, 80 ≤ FPE < 90, 90 ≤ FPE < 100, 100 ≤ FPE < 500, 500 ≤ FPE), the distribution of (A) the number of proteins and (B) the number of non‐coding RNAs is shown as pie chart. The accumulated FPEs (the total literature score) of the named entities within those FPE brackets is presented in (C) for protein‐coding genes and (D) for non‐protein‐coding genes.
Figure 2
Figure 2
The number of new protein‐coding genes in a given year with accumulated FPE score crossing different thresholds the first time. The notion “Tx” (where “x” is a natural number) is used to describe thresholds for a given named entity to cross the literature score threshold with at least “x” FPEs. So, “T0” denotes at least a single occurrence of the gene name (score > 0) in the scientific literature. “T1” requires at least one full FPE to be accumulated. Accordingly, other thresholds such as T5, T10, T15, T20, T25, T30, T35, T40, T45, T50, T75, T100, and T500 are defined respectively. See Supporting Information file 1 for the exact protein‐coding gene numbers. The graph from 1990–2017 is zoomed on the left and shown in box.
Figure 3
Figure 3
The number of new non‐protein‐coding genes in a given year with accumulated FPE score crossing different thresholds the first time. The notion “Tx” (where “x” is a natural number) is used to describe thresholds for a given named entity to cross the literature score threshold with at least “x” FPEs. So, “T0” denotes at least a single occurrence of the gene name (score>0) in the scientific literature. “T1” requires at least one full FPE to be accumulated. Accordingly, other thresholds such as T5, T10, T15, T20, T25, T30, T35, T40, T45, T50, T75, T100, and T500 are defined respectively. See Supporting Information file 2 for the exact gene numbers. The graph from 1990–2017 is zoomed on the left and shown in box.
Figure 4
Figure 4
The number of any new genes in a given year with accumulated FPE score crossing different thresholds the first time for the period 1990–2017. Here, we show the total gene function discovery rate (combining the data for protein‐coding genes and non‐coding RNA) for the years 1990–2020. The notion “Tx” (where “x” is a natural number) is used to describe thresholds for a given named entity to cross the literature score threshold with at least “x” FPEs. So, “T0” denotes at least a single occurrence of the gene name (score > 0) in the scientific literature. “T1” requires at least one full FPE to be accumulated. Accordingly, other thresholds such as T5, T10, T15, T20, T25, T30, T35, T40, T45, T50, T75, T100, and T500 are defined respectively. See Supporting Information file 2 for the exact gene numbers.

References

    1. Levin L. A., Behar‐Cohen F., Trends Pharmacol. Sci. 2017, 38, 1052. - PMC - PubMed
    1. Eisenhaber F. A., J. Bioinform. Comput. Biol. 2012, 10, 1271001. - PubMed
    1. Kuznetsov V., Lee H. K., Maurer‐Stroh S., Molnar M. J., Pongor S., Eisenhaber B., Eisenhaber F., Health Inf. Sci. Syst. 2013, 1, 2. - PMC - PubMed
    1. Bork P., Dandekar T., Diaz‐Lazcoz Y., Eisenhaber Fen., M. Y., J. Mol. Biol. 1998, 283, 707. - PubMed
    1. Eisenhaber B., Sinha S., Wong W. C., Eisenhaber F., Cell Cycle 2018, 1. - PMC - PubMed

Publication types

LinkOut - more resources