Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Jun;2(6):e162.
doi: 10.1371/journal.pbio.0020162. Epub 2004 Apr 20.

Integrative annotation of 21,037 human genes validated by full-length cDNA clones

Tadashi Imanishi  1 Takeshi ItohYutaka SuzukiClaire O'DonovanSatoshi FukuchiKanako O KoyanagiRoberto A BarreroTakuro TamuraYumi Yamaguchi-KabataMotohiko TaninoKei YuraSatoru MiyazakiKazuho IkeoKeiichi HommaArek KasprzykTetsuo NishikawaMika HirakawaJean Thierry-MiegDanielle Thierry-MiegJennifer AshurstLibin JiaMitsuteru NakaoMichael A ThomasNicola MulderYoula KaravidopoulouLihua JinSangsoo KimTomohiro YasudaBoris LenhardEric EvenoYoshiyuki SuzukiChisato YamasakiJun-ichi TakedaCraig GoughPhillip HiltonYasuyuki FujiiHiroaki SakaiSusumu TanakaClara AmidMatthew BellgardMaria de Fatima BonaldoHidemasa BonoSusan K BrombergAnthony J BrookesElspeth BrufordPiero CarninciClaude ChelalaChristine CouillaultSandro J de SouzaMarie-Anne DebilyMarie-Dominique DevignesInna DubchakToshinori EndoAnne EstreicherEduardo EyrasKaoru Fukami-KobayashiGopal R GopinathEsther GraudensYoonsoo HahnMichael HanZe-Guang HanKousuke HanadaHideki HanaokaErimi HaradaKatsuyuki HashimotoUrsula HinzMomoki HiraiTeruyoshi HishikiIan HopkinsonSandrine ImbeaudHidetoshi InokoAlexander KanapinYayoi KanekoTakeya KasukawaJanet KelsoPaul KerseyReiko KikunoKouichi KimuraBernhard KornVladimir KuryshevIzabela MakalowskaTakashi MakinoShuhei ManoRegine Mariage-SamsonJun MashimaHideo MatsudaHans-Werner MewesShinsei MinoshimaKeiichi NagaiHideki NagasakiNaoki NagataRajni NigamOsamu OgasawaraOsamu OharaMasafumi OhtsuboNorihiro OkadaToshihisa OkidoSatoshi OotaMotonori OtaToshio OtaTetsuji OtsukiDominique Piatier-TonneauAnnemarie PoustkaShuang-Xi RenNaruya SaitouKatsunaga SakaiShigetaka SakamotoRyuichi SakateIngo SchuppFlorence ServantStephen SherryRie ShibaNobuyoshi ShimizuMary ShimoyamaAndrew J SimpsonBento SoaresCharles StewardMakiko SuwaMami SuzukiAiko TakahashiGen TamiyaHiroshi TanakaTodd TaylorJoseph D TerwilligerPer UnnebergVamsi VeeramachaneniShinya WatanabeLaurens WilmingNorikazu YasudaHyang-Sook YooMarvin StodolskyWojciech MakalowskiMitiko GoKenta NakaiToshihisa TakagiMinoru KanehisaYoshiyuki SakakiJohn QuackenbushYasushi OkazakiYoshihide HayashizakiWinston HideRanajit ChakrabortyKen NishikawaHideaki SugawaraYoshio TatenoZhu ChenMichio OishiPeter TonellatoRolf ApweilerKousaku OkuboLukas WagnerStefan WiemannRobert L StrausbergTakao IsogaiCharles AuffrayNobuo NomuraTakashi GojoboriSumio Sugano
Affiliations

Integrative annotation of 21,037 human genes validated by full-length cDNA clones

Tadashi Imanishi et al. PLoS Biol. 2004 Jun.

Erratum in

  • PLoS Biol. 2004 Jul;2(7):e256

Abstract

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no conflicts of interest exist.

Figures

Figure 1
Figure 1. Procedure for Mapping and Clustering the H-Inv cDNAs
The cDNAs were mapped to the genome and clustered into loci. The remaining unmapped cDNAs were clustered based upon the grouping of significantly similar cDNAs.
Figure 2
Figure 2. A Comparison of the Mapped H-Inv FLcDNAs and the RefSeq mRNAs
The mapped H-Inv cDNAs, the RefSeq curated mRNAs (accession prefixes NM and NR), and the RefSeq model mRNAs (accession prefixes XM and XR) provided by the genome annotation process were clustered based on the genome position. The numbers of loci that were identified by clustering are shown.
Figure 3
Figure 3. An Example of Different Structures Encoded by AS Variants
Exons are presented from the 5′ end, with those shared by AS variants aligned vertically. The AS variants, with accession numbers AK095301 and BC007828, are aligned to the SCOP domain d.136.1.1 and corresponding PDB structure 1byr. Helices and beta sheets are red and yellow, respectively. Green bars indicate regions aligned to the PDB structure, while open rectangles represent gaps in the alignments. AK095301 is aligned to the entire PDB structure shown, while BC007828 is lacking the alignment to the purple segment of the structure.
Figure 4
Figure 4. Schematic Diagram of Human Curation for H-Inv Proteins
The diagram illustrates the human curation pipeline to classify H-Inv proteins into five similarity categories; Category I , II, III, IV, and V proteins.
Figure 5
Figure 5. The Manual Annotation Flow Chart of ncRNAs
Candidate non-protein-coding genes were compared with the human genome, ESTs, cDNA 3′-end features and the locus genomic environment. The candidates were then classified into four categories: hold (cDNAs improperly mapped onto the human genome); uncharacterized transcripts (transcripts overlapping a sense gene or located within 5 kb of a neighboring gene with EST support); putative ncRNAs (multiexon or single exon transcripts supported by ESTs or 3′-end features); and unclassifiable (possible genomic fragments).
Figure 6
Figure 6. The Functional Classification of H-Inv Proteins That Are Homologous to Proteins in Each Taxonomic Group
The numbers of representative H-Inv cDNAs with sequence homology to other species' proteins (E < 10−5) were calculated. The cDNAs for which we could not assign any functions were discarded. Mammalian species were excluded from the “animal” group. “Eukaryote” represents eukaryotic species other than those included in the mammal, animal, fungi, and plant groups. See also Table S7.
Figure 7
Figure 7. Window Analysis of Similarity between Human and Mouse UTRs
Results for 5′ UTRs presented above and for 3′ UTRs below. The whole mRNA sequences were aligned using a semiglobal algorithm as implemented in the map program (Huang 1994) with the following parameters: match 10, mismatch −3, gap opening penalty −50, gap extension penalty −5, and longest penalized gap 10; the terminal gaps are not penalized at all. A window size of 20 bp was used with a step of 10 bp. The analysis window was moved upstream and downstream of start and stop codons, respectively. The normalized score for a given window is calculated as a fraction of an average score for all UTRs in a given window over the maximum score observed in all 5′ or 3′ UTRs, respectively.

References

    1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, et al. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science. 1991;252:1651–1656. - PubMed
    1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. The genome sequence of Drosophila melanogaster . Science. 2000;287:2185–2195. - PubMed
    1. [AGI] Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
    1. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–1814. - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed

Publication types

Substances