Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Jul;12(7):1127-34.
doi: 10.1101/gr.75202.

A computer-based method of selecting clones for a full-length cDNA project: simultaneous collection of negligibly redundant and variant cDNAs

Affiliations

A computer-based method of selecting clones for a full-length cDNA project: simultaneous collection of negligibly redundant and variant cDNAs

Naoki Osato et al. Genome Res. 2002 Jul.

Abstract

We describe a computer-based method that selects representative clones for full-length sequencing in a full-length cDNA project. Our method classifies end sequences using two kinds of criteria, grouping, and clustering. Grouping places together variant cDNAs, family genes, and cDNAs with sequencing errors. Clustering separates those cDNA clones into distinct clusters. The full-length sequences of the clones selected by grouping are determined preferentially, and then the sequences selected by clustering are determined. Grouping reduced the number of rice cDNA clones for full-length sequencing to 21% and mouse cDNA clones to 25%. Rice full-length sequences selected by grouping showed a 1.07-fold redundancy. Mouse full-length sequences showed a 1.04-fold redundancy, which can be reduced by approximately 30% from the selection using our previous method. To estimate the coverage of unique genes, we used FANTOM (Functional Annotation of RIKEN Mouse cDNA Clones) clusters (). Grouping covered almost all unique genes (93% of FANTOM clusters), and clustering covered all genes. Therefore, our method is useful for the selection of appropriate representative clones for full-length sequencing, thereby greatly reducing the cost, labor, and time necessary for this process.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flow chart of two-step classification of end sequences. After the determination of the 5′ and/or 3′ end sequences, we classify them on the basis of two distinct criteria: (A) grouping and (B) clustering. Each classification is performed separately and consists of three steps: preprocessing, grouping or clustering, and selection of the representative clones. The criteria of each step are determined as stated in the Methods section.
Figure 2
Figure 2
Multiple alignment of full-length cDNAs belonging to the same group and having the same functional annotation using the ClustalX program (Thompson et al. 1997). (A) This group contains four cDNAs of the histone 4 protein (AK016310, AK007642, AK011560, and AK010085). These sequences have a homologous region and a variable region, so that these sequences were placed together in grouping but separated into distinct clusters. We compared these sequences with mouse draft genome sequences. As a result, these sequences matched all distinct loci on the mouse draft genome, indicating that they were derived from distinct genes. With regard to grouping, one clone is selected as the representative, but all clones are selected as representatives after clustering. (B) This group contains four cDNAs of the acidic ribosomal phosphoprotein PO (AK002315, AK009767, AK010267, and AK012606). The 3′ ends of the upper two cDNA sequences differ in length by ≤20 bp; therefore, these sequences are regarded as the same cDNAs after clustering. These sequences matched the same locus on the mouse genome. However, the lengths of the 3′ ends of the other two cDNA sequences differ by >20 bp, so that these sequences separate into distinct clusters. These sequences may match another locus on the mouse genome. With regard to grouping, one clone is selected as the representative, but three clones are selected as representatives after clustering.
Figure 3
Figure 3
An increased number of groups and clusters of 3′ end sequences in the rice full-length cDNA project. As end sequences were determined, the number of novel groups and clusters increased, whereas the rate of the increase gradually decreased.
Figure 4
Figure 4
Determination of the criteria of identity value and overlapping length in grouping condition. We classified 213,404 mouse 3′ end sequences in light of the results of homology searches using BLAST software (Pearson and Lipman 1988) to determine the criteria of grouping. (A) Clones whose end sequences were more similar than the identity threshold were placed together. Here, the identity threshold varied from 80% to 98%. The number of resulting groups increased as the identity threshold increased from 90%; therefore, an identity threshold of 90% is appropriate for placing similar sequences together. (B) Clones whose end sequences exceeded the overlap threshold were placed in the same group. In this example, the overlap threshold varied from 20 to 200 bp. The number of resulting groups was almost constant between 30 and 150 bp; therefore, an overlap threshold of 30 to 150 bp is appropriate for grouping.

References

    1. Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995;377 (6547 Suppl):3–174. - PubMed
    1. Boguski MS, Schuler GD. Establishing a human transcript map. Nat Genet. 1995;10:369–371. - PubMed
    1. Bono H, Kasukawa T, Furuno M, Hayashizaki Y, Okazaki Y. FANTOM DB: Database of Functional Annotation of RIKEN Mouse cDNA Clones. Nucleic Acids Res. 2002;30:116–118. - PMC - PubMed
    1. Bouck J, Yu W, Gibbs R, Worley K. Comparison of gene indexing databases. Trends Genet. 1999;15:159–162. - PubMed
    1. Burke J, Wang H, Hide W, Davison DB. Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Res. 1998;8:276–290. - PMC - PubMed

Publication types

LinkOut - more resources