Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May 3:12:16.
doi: 10.1186/1472-6750-12-16.

Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data

Affiliations

Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data

Sun Zhou et al. BMC Biotechnol. .

Abstract

Background: Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be "unclean". Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction.

Results: After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3'-end terminal structures in designated 5'-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, http://code.google.com/p/afst/) using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are "unclean" or abnormal, all of which could be cleaned or filtered by AFST.

Conclusions: cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The expanded definitions of cDNA terminal structures. The original four canonical cDNA termini – 5TSS, 3TSS, 5TNS and 3TNS [12] have been expanded by adding some sub-categories.
Figure 2
Figure 2
The expected construction of cDNA insertion and all types of Restriction Enzyme Cutting Abnormality (RECA). The label “Expected” means the expected construction of cDNA library. Sequencing direction is indicated as 3′ or 5′ with an arrow. VF1 (Vector fragment 1) and VF2 (Vector fragment 2) are referred to the left and right vector borders of the cloning sites. A, B, C, D, E and F are special types of RECA, defined as following: RECA-Type A: EcoRI site is cut off but XhoI site remains intact. A1: cDNA is inserted with inversion; A2: cDNA is inserted without inversion; A3: Adapter/linker fragments are inserted. RECA-Type B: XhoI site is cut off but EcoRI site remains intact. B1: cDNA is inserted with inversion; B2: cDNA is inserted without inversion. RECA-Type C: Neither of the two enzyme sites is cut off. RECA-Type D: Both the two enzyme sites are cut off, but the excised vector fragment remains. RECA-Type E: XhoI cuts off the vector at wrong site. RECA-Type F: EcoRI cuts off the vector at wrong site. The yellow color indicates EcoRI recognition site or EcoRI sticky end. The brown color stands for XhoI recognition site or XhoI sticky end. The blue represents the plasmid vector. Dark green denotes for adapter/linker fragment. cDNA insert direction is represented by red color with gradual changes: cDNA sense strand is from deep red to light red whereas cDNA non-sense strand is from light red to deep red.
Figure 3
Figure 3
Detailed illustration of two sub-categories of Type A Restriction Enzyme Cutting Abnormality (RECA-Type A). RECA-Type A indicates that EcoRI site of the vector is cut off whereas XhoI site is kept. A1 is the special case where cDNA is inserted with inversion while cDNA is inserted without inversion for A2. Because XhoI and EcoRI sticky ends cannot be smoothly ligated, so a random sequence fragment between the vector and cDNA end have been detected. Blue stands for the plasmid vector, yellow for EcoRI, brown for XhoI, red for cDNA, gray for a random sequence fragment, pink for Adapter1, and green either for poly(A) in sense strand of cDNA or for poly(T) in non-sense strand of cDNA.
Figure 4
Figure 4
Schematic view of double-termini adapters showing two types of concatenation.
Figure 5
Figure 5
Snapshots of AFST user interfaces. a: The main interface allows users to upload their sequences, specify relevant information about vector and adapter/linker sequences, initiate data processing, and obtain tabular results showing abnormality. b: Details of a normal sequence. The high-quality region between 5TNS-4 (from 2 to 62, marked with blue and green) and 3TNS (from 900 to 926, marked with pink, yellow and blue) is the final clean sequence (i.e., the region with a light red background). The color legends and their meanings can be found by clicking ‘color table’. c: Details of an abnormal sequence. This sequence has RECA abnormality (RECA-Type A1), where the double-stranded cDNA insert is inverted in its orientation and inserted into the double-strand plasmid vector after enzyme digestion. The vector sequence region between 5TNS-2 (highlighted with blue and brown) and 5TSS-1 (highlighted with yellow and pink) is the part that should have been cut off theoretically after enzyme digestion.

Similar articles

Cited by

References

    1. Cairney J, Zheng L, Cowels A, Hsiao J, Zismann V, Liu J, Ouyang S, Thibaud-Nissen F, Hamilton J, Childs K, Pullman GS, Zhang Y, Oh T, Buell CR. Expressed Sequence Tags from loblolly pine embryos reveal similarities with angiosperm embryogenesis. Plant Mol Biol. 2006;62:485–501. doi: 10.1007/s11103-006-9035-9. - DOI - PubMed
    1. Lorenz WW, Sun F, Liang C, Kolychev D, Wang H, Zhao X, Cordonnier-Pratt MM, Pratt LH, Dean JF. Water stress-responsive genes in loblolly pine (Pinus taeda) roots identified by analyses of expressed sequence tag libraries. Tree Physiol. 2006;26:1–16. doi: 10.1093/treephys/26.1.1. - DOI - PubMed
    1. Pavy N, Laroche J, Bousquet J, Mackay J. Large-scale statistical analysis of secondary xylem ESTs in pine. Plant Mol Biol. 2005;57:203–224. doi: 10.1007/s11103-004-6969-7. - DOI - PubMed
    1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF. et al.Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991;252:1651–1656. doi: 10.1126/science.2047873. - DOI - PubMed
    1. Adams MD, Dubnick M, Kerlavage AR, Moreno R, Kelley JM, Utterback TR, Nagle JW, Fields C, Venter JC. Sequence identification of 2,375 human brain genes. Nature. 1992;355:632–634. doi: 10.1038/355632a0. - DOI - PubMed

Publication types