. 2007 Dec 4;104(49):19428-33.

doi: 10.1073/pnas.0709013104. Epub 2007 Nov 26.

Distinguishing protein-coding and noncoding genes in the human genome

Michele Clamp¹, Ben Fry, Mike Kamal, Xiaohui Xie, James Cuff, Michael F Lin, Manolis Kellis, Kerstin Lindblad-Toh, Eric S Lander

Affiliations

PMID: 18040051
PMCID: PMC2148306
DOI: 10.1073/pnas.0709013104

Distinguishing protein-coding and noncoding genes in the human genome

Michele Clamp et al. Proc Natl Acad Sci U S A. 2007.

. 2007 Dec 4;104(49):19428-33.

doi: 10.1073/pnas.0709013104. Epub 2007 Nov 26.

Authors

Michele Clamp¹, Ben Fry, Mike Kamal, Xiaohui Xie, James Cuff, Michael F Lin, Manolis Kellis, Kerstin Lindblad-Toh, Eric S Lander

Affiliation

¹ Broad Institute of Massachusetts Institute of Technology and Harvard, 7 Cambridge Center, Cambridge, MA 02142, USA. mclamp@broad.mit.edu

PMID: 18040051
PMCID: PMC2148306
DOI: 10.1073/pnas.0709013104

Abstract

Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Fig. 1.**
Flowchart of the analysis. The central pipeline illustrates the computational analysis of 21,895 putative genes in the Ensembl catalog (v35). We then performed manual inspection of 1,178 cases to obtain the tables of likely valid and invalid genes. See text for details.

**Fig. 2.**
Cumulative distributions of RFC score. (*Left*) Human genes with cross-species orthologs (blue) versus matched random controls (black). (*Right*) Human orphans (red) versus matched random controls (black). RFC scores are calculated relative to mouse and dog together (*Top*), macaque (*Middle*) and chimpanzee (*Bottom*). In all cases, the orthologs are strikingly different from their matched random controls, whereas the orphans are essentially indistinguishable from their matched random controls.

**Fig. 3.**
An example gene report card for a small gene, HAMP, on chromosome 19. Report cards for all 22,218 putative genes in Ensembl v35 are available at www.broad.mit.edu/mammals/alpheus. The report cards provide a visual framework for studying cross-species conservation and for spotting possible problems in the human gene annotation. Information at the top shows chromosomal location; alternative identifiers; and summary information, such as length, number of exons, and repeat content. Various panels below provide graphical views of the alignment of the human gene to the mouse and dog genomes. “Synteny” shows the large-scale alignment of genomic sequence, indicating both aligned and unaligned segments. The human sequence is annotated with the exons in white and repetitive sequence in dark gray. “Alignment detail” shows the complete DNA sequence alignment and protein alignment. In the DNA alignment, the human sequence is given at the top, bases in the other species are marked as matching (light gray) or nonmatching (dark gray), exon boundaries are marked by vertical lines, indels are marked by small triangles above the sequence (vertex down for insertions, vertex up for deletions, number indicating length in bases), the annotated start codon is in green, and the annotated stop codon is in purple. In the protein alignment, the human amino acid sequence is given at the top, and the sequences in the other species are marked as matching (light gray), similar (pink), or nonmatching (red). “Frame alignment” shows the distribution of nucleotide mismatches found in each codon position, with excess mutations expected in the third position. Matching are shown in light gray, and mismatches are shown in dark gray. “Indels, starts and stops” provides an overview of key events. Indels are indicated by triangles (vertex down for insertions, vertex up for deletions) and marked as frameshifting (red) or frame-preserving (gray). Start codons are marked in green and stop codons in purple. “Splice sites” shows sequence conservation around splice sites, with two-base donor and acceptor sites highlighted in gray and mismatching bases indicated in red. “Summary data” lists various conservation statistics relative to mouse and dog, including RFC score, nucleotide identity, number of conserved splice sites, frameshifting and nonframeshifting indel density/kb, and gene neighborhood. The gene neighborhood shows a dot for the three upstream and downstream genes, which is colored gray if synteny is preserved and red otherwise.

See this image and copyright information in PMC

References

1. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, et al. Science. 2005;308:1149–1154. - PubMed
1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, et al. Science. 2005;309:1559–1563. - PubMed
1. ENCODE Project Consortium. Nature. 2007;447:799–816. - PMC - PubMed
1. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Nucleic Acids Res. 2007;35:D610–D617. - PMC - PubMed
1. Pruitt KD, Tatusova T, Maglott DR. Nucleic Acids Res. 2007;35:D61–D65. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

R01 HG004037/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- H1 Connect - Access expert opinions and insights on biomedical research.
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Distinguishing protein-coding and noncoding genes in the human genome

Affiliation

Distinguishing protein-coding and noncoding genes in the human genome

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources