d2_cluster: a validated method for clustering EST and full-length cDNAsequences

J Burke¹, D Davison, W Hide

Affiliations

PMID: 10568753
PMCID: PMC310833
DOI: 10.1101/gr.9.11.1135

d2_cluster: a validated method for clustering EST and full-length cDNAsequences

J Burke et al. Genome Res. 1999 Nov.

. 1999 Nov;9(11):1135-42.

doi: 10.1101/gr.9.11.1135.

Authors

J Burke¹, D Davison, W Hide

Affiliation

¹ Pangea Systems, Oakland, California 94612, USA. jburke@pangeasystems. com

PMID: 10568753
PMCID: PMC310833
DOI: 10.1101/gr.9.11.1135

Abstract

Several efforts are under way to condense single-read expressed sequence tags (ESTs) and full-length transcript data on a large scale by means of clustering or assembly. One goal of these projects is the construction of gene indices where transcripts are partitioned into index classes (or clusters) such that they are put into the same index class if and only if they represent the same gene. Accurate gene indexing facilitates gene expression studies and inexpensive and early partial gene sequence discovery through the assembly of ESTs that are derived from genes that have yet to be positionally cloned or obtained directly through genomic sequencing. We describe d2_cluster, an agglomerative algorithm for rapidly and accurately partitioning transcript databases into index classes by clustering sequences according to minimal linkage or "transitive closure" rules. We then evaluate the relative efficiency of d2_cluster with respect to other clustering tools. UniGene is chosen for comparison because of its high quality and wide acceptance. It is shown that although d2_cluster and UniGene produce results that are between 83% and 90% identical, the joining rate of d2_cluster is between 8% and 20% greater than UniGene. Finally, we present the first published rigorous evaluation of under and over clustering (in other words, of type I and type II errors) of a sequence clustering algorithm, although the existence of highly identical gene paralogs means that care must be taken in the interpretation of the type II error. Upper bounds for these d2_cluster error rates are estimated at 0.4% and 0.8%, respectively. In other words, the sensitivity and selectivity of d2_cluster are estimated to be >99.6% and 99.2%.

PubMed Disclaimer

Figures

**Figure 1**
Subsetting comparison of UniGene and d2_cluster. Cluster equivalence means that all elements in one cluster are also present in the other cluster, and vice versa. Out of 14,989 (= 15,226 − 237) original UniGene clusters and 13,755 d2 clusters, 12,389 (or 83% of UniGene clusters and 90% of d2 clusters) are equivalent. Two hundred thirty-seven UniGene clusters were not considered in the analysis because they were composed of sequences that were screened out in our vector and repetitive elements screening stage.

**Figure 2**
(A) CRAW report (Burke et al. 1998) for a cluster formed by d2_cluster that contains two UniGene clusters: Rn.8 and Rn.3110. (B) (Available as on online supplement to this paper at www.genome.org and at the authors web site at www.pangeasystems.com) Interleaved sequence alignment shows a >300-bp region of near perfect match.

**Figure 3**
(A) CRAW report for a d2 cluster containing isozymes of mouse cytochrome P-450. Seven UniGene clusters (Rn.10843, Rn.3586, Rn.18603, Rn.10842, Rn.9104, Rn.11043, and Rn.15544) are merged. (B) (Online supplement available at www.genome.org and www.pangeasystems.com) Interleaved multiple alignment showing a region of 240 bases with high identity alignment between all four cluster assemblies. d2_cluster has put all of these sequences together because of regions of high identity (as seen in Fig. 3B). UniGene has separated isozymes into distinct clusters, although UniGene clusters Rn.18603, Rn.10842, and Rn.9104 should probably form a single cluster according to reasonable clustering rules due to their perfect assembly into subgroup 1 and high overlap.

**Figure 4**
Alternative splice forms of the RAD1/REC1 gene are placed in the same cluster by d2_cluster, and the splice variants are separated into distinct subclusters by CRAW.

See this image and copyright information in PMC

References

1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science. 1991;252:1651–1656. - PubMed
1. Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. - PubMed
1. Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature (Suppl.) 1995;377:3–17. - PubMed
1. Benson DA, Boguski MS, Lipman DJ, Ostell J. GenBank. Nucleic Acids Res. 1994;22:3441–3444. - PMC - PubMed
1. Boguski MS, Schuler GD. ESTablishing a human transcript map. Nat Genet. 1995;10:369–371. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

d2_cluster: a validated method for clustering EST and full-length cDNAsequences

Affiliation

d2_cluster: a validated method for clustering EST and full-length cDNAsequences

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials