Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Nov 13:8:416.
doi: 10.1186/1471-2164-8-416.

An optimized procedure greatly improves EST vector contamination removal

Affiliations

An optimized procedure greatly improves EST vector contamination removal

Yi-An Chen et al. BMC Genomics. .

Abstract

Background: The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.

Results: In a recent study, where the program SeqClean was used to remove vector contamination from our test set of EST data compiled through various library construction systems, however, a significant number of errors remained after preliminary trimming. These errors were later almost completely corrected by simply using a re-linearized form of the cloning vector to compare against the target ESTs. The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match. Using SeqClean with a re-linearized form of the cloning vector significantly surpassed the other two programs in all tested conditions, while the performance of the other two programs was not influenced by the modified procedure. Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.

Conclusion: Vector contamination remains a serious concern to the data quality in the public sequence database nowadays. Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of some trimming details. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.
Figure 2
Figure 2
Re-linearization of vector pGEM-T at its cloning site. A. Simplified map of vector pGEM-T. The insert DNA of interest was cloned into position between bases 60 and 61. The primers were introduced with the DNA insert during cDNA preparation carried out in a wet lab. B. Vector sequence of pGEM-T before and after re-linearization Bases 1–198 and 2899–3015 were expressed. The omitted nucleotides are expressed as dotted lines. Additional nucleotides TA (colored in blue) were appended to the vector at position 60 during a wet-lab experimental procedure. The letters in pink boxes (bases 1 ~ 60 plus the appended T) were moved electronically to the end of the sequence for vector re-linearization.

Similar articles

Cited by

References

    1. Bork P, Bairoch A. Go hunting in sequence databases but watch out for the traps. Trends Genet. 1996;12:425–427. doi: 10.1016/0168-9525(96)60040-7. - DOI - PubMed
    1. Colleagues CTGoBMa Quality control in databanks for molecular biology. Bioessays. 2000;22:1024–1034. doi: 10.1002/1521-1878(200011)22:11<1024::AID-BIES9>3.0.CO;2-W. - DOI - PubMed
    1. Seluja GA, Farmer A, McLeod M, Harger C, Schad PA. Establishing a method of vector contamination identification in database sequences. Bioinformatics. 1999;15:106–110. doi: 10.1093/bioinformatics/15.2.106. - DOI - PubMed
    1. Lamperti ED, Kittelberger JM, Smith TF, Villa-Komaroff L. Corruption of genomic databases with anomalous sequence. Nucleic Acids Res. 1992;20:2741–2747. doi: 10.1093/nar/20.11.2741. - DOI - PMC - PubMed
    1. Korning PG, Hebsgaard SM, Rouze P, Brunak S. Cleaning theGenBank Arabidopsis thaliana data set. Nucleic Acids Res. 1996;24:316–320. doi: 10.1093/nar/24.2.316. - DOI - PMC - PubMed