Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2007 Dec;17(12):1763-73.
doi: 10.1101/gr.7128207. Epub 2007 Nov 7.

Targeted discovery of novel human exons by comparative genomics

Affiliations
Comparative Study

Targeted discovery of novel human exons by comparative genomics

Adam Siepel et al. Genome Res. 2007 Dec.

Abstract

A complete and accurate set of human protein-coding gene annotations is perhaps the single most important resource for genomic research after the human-genome sequence itself, yet the major gene catalogs remain incomplete and imperfect. Here we describe a genome-wide effort, carried out as part of the Mammalian Gene Collection (MGC) project, to identify human genes not yet in the gene catalogs. Our approach was to produce gene predictions by algorithms that rely on comparative sequence data but do not require direct cDNA evidence, then to test predicted novel genes by RT-PCR. We have identified 734 novel gene fragments (NGFs) containing 2188 exons with, at most, weak prior cDNA support. These NGFs correspond to an estimated 563 distinct genes, of which >160 are completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes. The NGFs appear to be predominantly protein-coding genes rather than noncoding RNAs, unlike novel transcribed sequences identified by technologies such as tiling arrays and CAGE. They tend to be expressed at low levels and in a tissue-specific manner, and they are enriched for roles in motor activity, cell adhesion, connective tissue, and central nervous system development. Our results demonstrate that many important genes and gene fragments have been missed by traditional approaches to gene discovery but can be identified by their evolutionary signatures using comparative sequence data. However, they suggest that hundreds-not thousands-of protein-coding genes are completely missing from the current gene catalogs.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
(A) Flowchart for computational exon discovery (CED). Beginning with three sets of gene predictions, candidate novel genes are tested for evidence of expression and splicing in several rounds of candidate selection, RT–PCR amplification, and sequencing. The result is a large set of EST-like sequences, called RSTs, that provided supporting evidence for novel protein-coding exons, but do not define full-length transcripts. (B) Illustration of CED. Gene 1 is known and well-supported by public cDNA sequences, so overlapping gene predictions are ignored. Predicted gene 2 appears to be novel and is selected for RT–PCR validation, but the validation experiment fails. Predicted gene 3 also appears to be novel and is tested by two RT–PCR experiments, both of which produce valid RSTs (“hits”). The first experiment validates the TRANSMAP prediction, and the second validates the N-SCAN prediction and one of two Exoniphy predictions. A cDNA cluster is constructed to summarize each set of overlapping cDNAs (including RSTs), and a novel gene fragment (NGF) is constructed by merging the two RSTs that support novel exons (NEs; in red).
Figure 2.
Figure 2.
Distributions of distances between nearest mismatches in human–mouse alignments for NGFs vs. CDSs, UTRs, and ncRNAs from RefSeq.
Figure 3.
Figure 3.
Number of benchmark exons completely supported by at least one cDNA sequence in GenBank as a function of time, and the rate of growth of this number (computed in a 12-mo sliding window). Separate curves are shown for all exons and for exons that overlap annotated CDSs of known genes. Four spikes in growth can be traced to major EST submissions by (1) Adams et al. (1993a, b), (2) Hillier et al. (1996), (3) Adams et al. (1995) and L.D. Hillier and colleagues (“The WashU-Merck EST Project,” unpubl.), and (4) Kimura et al. (2006). The largest spike, between (3) and (4), comes from various sources.
Figure 4.
Figure 4.
Hierarchical clustering of over-represented GO categories, based on the NGFs assigned to each category. This dendrogram is derived from a dissimilarity matrix defined such that any two GO categories, X and Y, have dissimilarity 0 when all NGFs assigned to X are also assigned to Y (or vice-versa), and dissimilarity 1 when the sets of NGFs assigned to X and Y do not overlap. (Specifically, X and Y have dissimilarity dXY = 1 − [|formula image(X)∩formula image(Y)|/ min{|formula image(X)|,|formula image(Y)|}], where formula image(C) denotes the (nonempty) set of NGFs assigned to GO category C.) As a result, GO categories associated with similar sets of NGFs group together in the dendrogram, even if these categories are not closely related in the GO hierarchy (such as “liver development” and “cell adhesion”). Here, two major groups of related categories are evident, broadly related to motor activity (Group A) and the extracellular region (Group B). (Dendrogram produced by the hclust function in R with method = “average.”)
Figure 5.
Figure 5.
Gene predictions, cDNA evidence, and novel gene fragments in the region on chromosome 1 that includes ngf51ngf55. Gene predictions are shown in green, prior cDNA evidence is in black, RSTs (which are represented in GenBank as ESTs) are in gold, and NGFs are in blue, with novel exons colored red. cDNA sequences recently deposited in GenBank (post 1/1/05) and ignored in evaluating novelty are shown in purple. This cluster of NGFs contributes 24 novel exons to a gene that spans >450 kb and consists of an estimated 66 exons. This gene appears to code for a novel axonemal dynein heavy-chain polypeptide.
Figure 6.
Figure 6.
Whole-mount in situ hybridization for a zebrafish sequence orthologous to ngf60, showing its expression pattern in the brain 48 h past fertilization (hpf). For comparison, the expression pattern is also shown for OTP, a homeobox transcription factor that was used as a positive control because of its highly specific and well described expression profile (Eaton and Glasgow 2007). The expression patterns of the two genes remain generally similar at 72 hpf (Supplemental material).

Similar articles

Cited by

References

    1. Adams M.D., Kerlavage A.R., Fields C., Venter J.C., Kerlavage A.R., Fields C., Venter J.C., Fields C., Venter J.C., Venter J.C. 3,400 new expressed sequence tags identify diversity of transcripts in human brain. Nat. Genet. 1993a;4:256–267. - PubMed
    1. Adams M.D., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., Kerlavage A.R., Fields C., Venter J.C., Fields C., Venter J.C., Venter J.C. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat. Genet. 1993b;4:373–380. - PubMed
    1. Adams M.D., Kerlavage A.R., Fleischmann R.D., Fuldner R.A., Bult C.J., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Kerlavage A.R., Fleischmann R.D., Fuldner R.A., Bult C.J., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Fleischmann R.D., Fuldner R.A., Bult C.J., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Fuldner R.A., Bult C.J., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Bult C.J., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Lee N.H., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Kirkness E.F., Weinstock K.G., Gocayne J.D., White O., Weinstock K.G., Gocayne J.D., White O., Gocayne J.D., White O., White O., et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995;377 (Suppl):3–174. - PubMed
    1. Arumugam M., Wei C., Brown R.H., Brent M.R., Wei C., Brown R.H., Brent M.R., Brown R.H., Brent M.R., Brent M.R. Pairagon+N-scan EST: A model-based gene annotation pipeline. Genome Biol. 2006;7 (Suppl 1):1–10. - PMC - PubMed
    1. Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Dolinski K., Dwight S.S., Eppig J.T., Dwight S.S., Eppig J.T., Eppig J.T., et al. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000;25:25–29. - PMC - PubMed

Publication types

LinkOut - more resources