Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1998 Mar;8(3):276-90.
doi: 10.1101/gr.8.3.276.

Alternative gene form discovery and candidate gene selection from gene indexing projects

Affiliations

Alternative gene form discovery and candidate gene selection from gene indexing projects

J Burke et al. Genome Res. 1998 Mar.

Abstract

Several efforts are under way to partition single-read expressed sequence tag (EST), as well as full-length transcript data, into large-scale gene indices, where transcripts are in common index classes if and only if they share a common progenitor gene. Accurate gene indexing facilitates gene expression studies, as well as inexpensive and early gene sequence discovery through assembly of ESTs that are derived from genes that have not been sequenced by classical methods. We extend, correct, and enhance the information obtained from index groups by splitting index classes into subclasses based on sequence dissimilarity (diversity). Two applications of this are highlighted in this report. First it is shown that our method can ameliorate the damage that artifacts, such as chimerism, inflict on index integrity. Additionally, we demonstrate how the organization imposed by an effective subpartition can greatly increase the sensitivity of gene expression studies by accounting for the existence and tissue- or pathology-specific regulation of novel gene isoforms and polymorphisms. We apply our subpartitioning treatment to the UniGene gene indexing project to measure a marked increase in information quality and abundance (in terms of assembly length and insertion/deletion error) after treatment and demonstrate cases where new levels of information concerning differential expression of alternate gene forms, such as regulated alternative splicing, are discovered. [Tables 2 and 3 can be viewed in their entirety as Online Supplements at http://www.genome.org.]

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic of UniGene processing.
Figure 2
Figure 2
Example of a CRAW report (text format) for a cluster afflicted with EST data derived from a possibly chimeric clone. The 34 sequences shown can be represented as two consensus sequences and an outlier sequence without information loss. The cluster was automatically partitioned into two consistent subgroups showing one sequence (GenBank accession no. AA015595) as being inconsistent with the two established subgroups. Similarity searching with BLAST against the NCBI nonredundant database indicates that the second subgroup, consisting of sequences representing GenBank accession nos. N20971 to AA076342, is highly similar to the 3′ end of mouse mRNA for talin. The second subgroup, sequences T90923 to AA136000 are identical to the coding region of human tubulin α-6 chain. Sequence AA015595 is a putative chimeric sequence within which 110 bases (contained in the first 5 positions of the CRAW report) are highly similar to 3′ UTR of talin mRNA. The rest of the sequence is highly similar to tubulin.
Figure 3
Figure 3
Examples of clusters from UniGene101 possibly containing clone reversal errors. (A) A cluster of retina-specific ESTs contains a clone reversal error. (B) A cluster containing ESTs that overlap the human nuclear factor I (NFI) gene. More information is need to decide whether this cluster contains a clone inversion event or is an example of genes overlapping on opposite strands.
Figure 3
Figure 3
Examples of clusters from UniGene101 possibly containing clone reversal errors. (A) A cluster of retina-specific ESTs contains a clone reversal error. (B) A cluster containing ESTs that overlap the human nuclear factor I (NFI) gene. More information is need to decide whether this cluster contains a clone inversion event or is an example of genes overlapping on opposite strands.
Figure 4
Figure 4
(A) Length of consensus sequences resulting from CRAW assembly/analysis on 5′ and 3′ ESTs from UniGene98 after CRAW processing. The x-axis denotes the number of sequences in the UniGene cluster; the y-axis represents consensus length. By forming an assembly with between 10 and 15 ESTs the length of the resulting contig can be doubled on average. Assemblies made from clusters containing >45 ESTs result in contigs that are 400% longer than unassembled sequences. The effective assembly length approaches the actual gene length in UniGene101: the sequences classified as multipass/full-length have an average length (♦) of 2102 and a median length (▴) of 1695 bases. (B) Length of the maximal ORF was measured after performing CRAW assembly/analysis on 5′ ESTs from UniGene98 clusters. The longest ORF of the resulting consensus sequence (in residues) is plotted against the number of 5′ sequences in the cluster. The axes are as in A. The effective ORF size generated from EST fragments easily surpasses 50% of the full-length gene maximal ORF length: the sequences classified as multipass or full-length in UniGene101 have an average maximal ORF length (♦) of 478 residues and a median length (▴) of 367 residues. The improvement shown is the result of both assembly of ESTs into longer contigs and the correction of insertion and deletion errors using sequence redundancy.
Figure 4
Figure 4
(A) Length of consensus sequences resulting from CRAW assembly/analysis on 5′ and 3′ ESTs from UniGene98 after CRAW processing. The x-axis denotes the number of sequences in the UniGene cluster; the y-axis represents consensus length. By forming an assembly with between 10 and 15 ESTs the length of the resulting contig can be doubled on average. Assemblies made from clusters containing >45 ESTs result in contigs that are 400% longer than unassembled sequences. The effective assembly length approaches the actual gene length in UniGene101: the sequences classified as multipass/full-length have an average length (♦) of 2102 and a median length (▴) of 1695 bases. (B) Length of the maximal ORF was measured after performing CRAW assembly/analysis on 5′ ESTs from UniGene98 clusters. The longest ORF of the resulting consensus sequence (in residues) is plotted against the number of 5′ sequences in the cluster. The axes are as in A. The effective ORF size generated from EST fragments easily surpasses 50% of the full-length gene maximal ORF length: the sequences classified as multipass or full-length in UniGene101 have an average maximal ORF length (♦) of 478 residues and a median length (▴) of 367 residues. The improvement shown is the result of both assembly of ESTs into longer contigs and the correction of insertion and deletion errors using sequence redundancy.
Figure 5
Figure 5
A set of ESTs from UniGene101 that sample an alternatively spliced gene. CRAW report for a UniGene101 cluster of 30 transcripts that can be represented as four consensus sequences and five outliers without information loss. Four sequences, representing GenBank accession nos. R36192, R36098, R63398, and R63347, were deleted for brevity. The cluster contains five full-length mRNAs each corresponding to a different splice form of RBP-MS, a gene-sharing homology to the RNA-binding domain of the Drosophila couch potato gene. One mRNA, corresponding to type 4 RBP-MS, is not sampled by any ESTs and splice forms 2 and 3 appear to be constitutively expressed. Black lines indicate gaps in the multiple alignment, black bars indicate indeterminate sequence, red bars indicate divergence from the sub-group consensus, and all other colors indicate discrete domains of sequence similarity.
Figure 6
Figure 6
CRAW output (Java version) for a UniGene101 cluster with a cancer-specific alternative gene form. Subgroup 1 (in green) ESTs are identical to CAMK. The green regions of subgroup 2 are identical to subgroup 1 sequences, and the blue region of subgroup 2 diverges. This is an example of how effective subparitioning can add sensitivity to gene expression specificity studies. If one were to seek cancer-specific genes by looking at the original UniGene index cluster, the level of specificity to cancer libraries would only be ∼53% = (100 × 9/17)%; however, at the subcluster level, the specificity level is 100%.

References

    1. Aaronson JS, Eckman B, Blevins RA, Borowski JA, Myerson J, Imran S, Elliston KO. Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data. Genome Res. 1996;6:829–845. - PubMed
    1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science. 1991;252:1651–1656. - PubMed
    1. Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. - PubMed
    1. Adams MD, Kerlavage AR, Flieschmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. (Suppl.) Nature. 1995;377:3–17. - PubMed
    1. Adams RM, Das S, Smith TF. Multiple domain protein diagnostic patterns. Protein Sci. 1996;5:1240–1249. - PMC - PubMed