. 2004 Jun;2(6):e162.

doi: 10.1371/journal.pbio.0020162. Epub 2004 Apr 20.

Integrative annotation of 21,037 human genes validated by full-length cDNA clones

Tadashi Imanishi¹, Takeshi Itoh, Yutaka Suzuki, Claire O'Donovan, Satoshi Fukuchi, Kanako O Koyanagi, Roberto A Barrero, Takuro Tamura, Yumi Yamaguchi-Kabata, Motohiko Tanino, Kei Yura, Satoru Miyazaki, Kazuho Ikeo, Keiichi Homma, Arek Kasprzyk, Tetsuo Nishikawa, Mika Hirakawa, Jean Thierry-Mieg, Danielle Thierry-Mieg, Jennifer Ashurst, Libin Jia, Mitsuteru Nakao, Michael A Thomas, Nicola Mulder, Youla Karavidopoulou, Lihua Jin, Sangsoo Kim, Tomohiro Yasuda, Boris Lenhard, Eric Eveno, Yoshiyuki Suzuki, Chisato Yamasaki, Jun-ichi Takeda, Craig Gough, Phillip Hilton, Yasuyuki Fujii, Hiroaki Sakai, Susumu Tanaka, Clara Amid, Matthew Bellgard, Maria de Fatima Bonaldo, Hidemasa Bono, Susan K Bromberg, Anthony J Brookes, Elspeth Bruford, Piero Carninci, Claude Chelala, Christine Couillault, Sandro J de Souza, Marie-Anne Debily, Marie-Dominique Devignes, Inna Dubchak, Toshinori Endo, Anne Estreicher, Eduardo Eyras, Kaoru Fukami-Kobayashi, Gopal R Gopinath, Esther Graudens, Yoonsoo Hahn, Michael Han, Ze-Guang Han, Kousuke Hanada, Hideki Hanaoka, Erimi Harada, Katsuyuki Hashimoto, Ursula Hinz, Momoki Hirai, Teruyoshi Hishiki, Ian Hopkinson, Sandrine Imbeaud, Hidetoshi Inoko, Alexander Kanapin, Yayoi Kaneko, Takeya Kasukawa, Janet Kelso, Paul Kersey, Reiko Kikuno, Kouichi Kimura, Bernhard Korn, Vladimir Kuryshev, Izabela Makalowska, Takashi Makino, Shuhei Mano, Regine Mariage-Samson, Jun Mashima, Hideo Matsuda, Hans-Werner Mewes, Shinsei Minoshima, Keiichi Nagai, Hideki Nagasaki, Naoki Nagata, Rajni Nigam, Osamu Ogasawara, Osamu Ohara, Masafumi Ohtsubo, Norihiro Okada, Toshihisa Okido, Satoshi Oota, Motonori Ota, Toshio Ota, Tetsuji Otsuki, Dominique Piatier-Tonneau, Annemarie Poustka, Shuang-Xi Ren, Naruya Saitou, Katsunaga Sakai, Shigetaka Sakamoto, Ryuichi Sakate, Ingo Schupp, Florence Servant, Stephen Sherry, Rie Shiba, Nobuyoshi Shimizu, Mary Shimoyama, Andrew J Simpson, Bento Soares, Charles Steward, Makiko Suwa, Mami Suzuki, Aiko Takahashi, Gen Tamiya, Hiroshi Tanaka, Todd Taylor, Joseph D Terwilliger, Per Unneberg, Vamsi Veeramachaneni, Shinya Watanabe, Laurens Wilming, Norikazu Yasuda, Hyang-Sook Yoo, Marvin Stodolsky, Wojciech Makalowski, Mitiko Go, Kenta Nakai, Toshihisa Takagi, Minoru Kanehisa, Yoshiyuki Sakaki, John Quackenbush, Yasushi Okazaki, Yoshihide Hayashizaki, Winston Hide, Ranajit Chakraborty, Ken Nishikawa, Hideaki Sugawara, Yoshio Tateno, Zhu Chen, Michio Oishi, Peter Tonellato, Rolf Apweiler, Kousaku Okubo, Lukas Wagner, Stefan Wiemann, Robert L Strausberg, Takao Isogai, Charles Auffray, Nobuo Nomura, Takashi Gojobori, Sumio Sugano

Affiliations

PMID: 15103394
PMCID: PMC393292
DOI: 10.1371/journal.pbio.0020162

Integrative annotation of 21,037 human genes validated by full-length cDNA clones

Tadashi Imanishi et al. PLoS Biol. 2004 Jun.

. 2004 Jun;2(6):e162.

doi: 10.1371/journal.pbio.0020162. Epub 2004 Apr 20.

Authors

Affiliation

¹ Integrated Database Group, Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan.

PMID: 15103394
PMCID: PMC393292
DOI: 10.1371/journal.pbio.0020162

Erratum in

PLoS Biol. 2004 Jul;2(7):e256

Abstract

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no conflicts of interest exist.

Figures

**Figure 1. Procedure for Mapping and Clustering the H-Inv cDNAs**
The cDNAs were mapped to the genome and clustered into loci. The remaining unmapped cDNAs were clustered based upon the grouping of significantly similar cDNAs.

**Figure 2. A Comparison of the Mapped H-Inv FLcDNAs and the RefSeq mRNAs**
The mapped H-Inv cDNAs, the RefSeq curated mRNAs (accession prefixes NM and NR), and the RefSeq model mRNAs (accession prefixes XM and XR) provided by the genome annotation process were clustered based on the genome position. The numbers of loci that were identified by clustering are shown.

**Figure 3. An Example of Different Structures Encoded by AS Variants**
Exons are presented from the 5′ end, with those shared by AS variants aligned vertically. The AS variants, with accession numbers AK095301 and BC007828, are aligned to the SCOP domain d.136.1.1 and corresponding PDB structure 1byr. Helices and beta sheets are red and yellow, respectively. Green bars indicate regions aligned to the PDB structure, while open rectangles represent gaps in the alignments. AK095301 is aligned to the entire PDB structure shown, while BC007828 is lacking the alignment to the purple segment of the structure.

**Figure 4. Schematic Diagram of Human Curation for H-Inv Proteins**
The diagram illustrates the human curation pipeline to classify H-Inv proteins into five similarity categories; Category I , II, III, IV, and V proteins.

**Figure 5. The Manual Annotation Flow Chart of ncRNAs**
Candidate non-protein-coding genes were compared with the human genome, ESTs, cDNA 3′-end features and the locus genomic environment. The candidates were then classified into four categories: hold (cDNAs improperly mapped onto the human genome); uncharacterized transcripts (transcripts overlapping a sense gene or located within 5 kb of a neighboring gene with EST support); putative ncRNAs (multiexon or single exon transcripts supported by ESTs or 3′-end features); and unclassifiable (possible genomic fragments).

**Figure 6. The Functional Classification of H-Inv Proteins That Are Homologous to Proteins in Each Taxonomic Group**
The numbers of representative H-Inv cDNAs with sequence homology to other species' proteins (E < 10⁻⁵) were calculated. The cDNAs for which we could not assign any functions were discarded. Mammalian species were excluded from the “animal” group. “Eukaryote” represents eukaryotic species other than those included in the mammal, animal, fungi, and plant groups. See also Table S7.

**Figure 7. Window Analysis of Similarity between Human and Mouse UTRs**
Results for 5′ UTRs presented above and for 3′ UTRs below. The whole mRNA sequences were aligned using a semiglobal algorithm as implemented in the map program (Huang 1994) with the following parameters: match 10, mismatch −3, gap opening penalty −50, gap extension penalty −5, and longest penalized gap 10; the terminal gaps are not penalized at all. A window size of 20 bp was used with a step of 10 bp. The analysis window was moved upstream and downstream of start and stop codons, respectively. The normalized score for a given window is calculated as a fraction of an average score for all UTRs in a given window over the maximum score observed in all 5′ or 3′ UTRs, respectively.

See this image and copyright information in PMC

References

1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, et al. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science. 1991;252:1651–1656. - PubMed
1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. The genome sequence of Drosophila melanogaster . Science. 2000;287:2185–2195. - PubMed
1. [AGI] Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
1. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–1814. - PMC - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

R01 HL064541/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrative annotation of 21,037 human genes validated by full-length cDNA clones

Affiliation

Integrative annotation of 21,037 human genes validated by full-length cDNA clones

Authors

Affiliation

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical