Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Dec 26:2016:baw153.
doi: 10.1093/database/baw153. Print 2016.

GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics

Affiliations

GeneBase 1.1: a tool to summarize data from NCBI gene datasets and its application to an update of human gene statistics

Allison Piovesan et al. Database (Oxford). .

Abstract

We release GeneBase 1.1, a local tool with a graphical interface useful for parsing, structuring and indexing data from the National Center for Biotechnology Information (NCBI) Gene data bank. Compared to its predecessor GeneBase (1.0), GeneBase 1.1 now allows dynamic calculation and summarization in terms of median, mean, standard deviation and total for many quantitative parameters associated with genes, gene transcripts and gene features (exons, introns, coding sequences, untranslated regions). GeneBase 1.1 thus offers the opportunity to perform analyses of the main gene structure parameters also following the search for any set of genes with the desired characteristics, allowing unique functionalities not provided by the NCBI Gene itself. In order to show the potential of our tool for local parsing, structuring and dynamic summarizing of publicly available databases for data retrieval, analysis and testing of biological hypotheses, we provide as a sample application a revised set of statistics for human nuclear genes, gene transcripts and gene features. In contrast with previous estimations strongly underestimating the length of human genes, a 'mean' human protein-coding gene is 67 kbp long, has eleven 309 bp long exons and ten 6355 bp long introns. Median, mean and extreme values are provided for many other features offering an updated reference source for human genome studies, data useful to set parameters for bioinformatic tools and interesting clues to the biomedical meaning of the gene features themselves.Database URL: http://apollo11.isto.unibo.it/software/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Gene type composition of GeneBase 1.1 Human entries for a total of 59 801 genes and (B) for 22 451 ‘REVIEWED’ or ‘VALIDATED’ genes with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript (genes not in current annotation release are excluded). Gene type labels are derived from ‘Gene_Type’ field of GeneBase 1.1 Human ‘Gene_Summary’ table as annotated in NCBI Gene as follows: protein-coding, pseudo (pseudogenes), ncRNA (non-coding RNA), snoRNA (small nucleolar RNA), snRNA (small nuclear RNA), rRNA (ribosomal RNA), tRNA (transfer RNA), ‘other’ and ‘unknown’.
Figure 2.
Figure 2.
Number of ‘REVIEWED’ or ‘VALIDATED’ genes with at least one ‘REVIEWED’ or ‘VALIDATED’ transcript in GeneBase 1.1 Human (genes not in current annotation release are excluded) divided in protein-coding genes, pseudogenes and non-coding genes (which include genes for ribosomal RNAs, small nucleolar RNAs, small nuclear RNAs and non-coding RNAs) for each human chromosome. See Table 1 and Supplementary Table S2 for more details.
Figure 3.
Figure 3.
Exon (A) and intron (B) length distributions considering GeneBase 1.1 Human ‘Gene_Table’ records with a ‘VALIDATED’ or ‘REVIEWED’ RefSeq status, with an ‘NM_’ (protein-coding RNAs, continuous lines) or ‘NR_’ (non-coding RNAs, dotted lines) type of corresponding RefSeq RNA accession number, belonging to ‘REVIEWED’ or ‘VALIDATED’ genes excluding those not in current annotation release.

Similar articles

Cited by

References

    1. Agarwala R., Barrett T., Beck J. et al. (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 44, D7–D19. - PMC - PubMed
    1. Aken B.L., Ayling S., Barrell D. et al. (2016) The Ensembl gene annotation system. Database (Oxford), 2016, baw093. - PMC - PubMed
    1. Speir M.L., Zweig A.S., Rosenbloom K.R. et al. (2016) The UCSC Genome Browser database: 2016 update. Nucleic Acids Res., 44, D717–D725. - PMC - PubMed
    1. Piovesan A., Vitale L., Pelleri M.C. et al. (2013) Universal tight correlation of codon bias and pool of RNA codons (codonome): the genome is optimized to allow any distribution of gene expression values in the transcriptome from bacteria to humans. Genomics, 101, 282–289. - PubMed
    1. Vitale L., Lenzi L., Huntsman S.A. et al. (2006) Differential expression of alternatively spliced mRNA forms of the insulin-like growth factor 1 receptor in human neuroendocrine tumors. Oncol. Rep., 15, 1249–1256. - PubMed

Publication types