. 2007 Nov 13:8:416.

doi: 10.1186/1471-2164-8-416.

An optimized procedure greatly improves EST vector contamination removal

Yi-An Chen¹, Chang-Chun Lin, Chin-Di Wang, Huan-Bin Wu, Pei-Ing Hwang

Affiliations

PMID: 17997864
PMCID: PMC2194723
DOI: 10.1186/1471-2164-8-416

An optimized procedure greatly improves EST vector contamination removal

Yi-An Chen et al. BMC Genomics. 2007.

. 2007 Nov 13:8:416.

doi: 10.1186/1471-2164-8-416.

Authors

Yi-An Chen¹, Chang-Chun Lin, Chin-Di Wang, Huan-Bin Wu, Pei-Ing Hwang

Affiliation

¹ Bioinformatics Core Laboratory, Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan. chenyian@gate.sinica.edu.tw

PMID: 17997864
PMCID: PMC2194723
DOI: 10.1186/1471-2164-8-416

Abstract

Background: The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.

Results: In a recent study, where the program SeqClean was used to remove vector contamination from our test set of EST data compiled through various library construction systems, however, a significant number of errors remained after preliminary trimming. These errors were later almost completely corrected by simply using a re-linearized form of the cloning vector to compare against the target ESTs. The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match. Using SeqClean with a re-linearized form of the cloning vector significantly surpassed the other two programs in all tested conditions, while the performance of the other two programs was not influenced by the modified procedure. Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.

Conclusion: Vector contamination remains a serious concern to the data quality in the public sequence database nowadays. Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.

PubMed Disclaimer

Figures

**Figure 1**
**Illustration of some trimming details**. The shaded area highlights the range covering the 30% from either end of the EST. According to the original SeqClean design, the vector contaminant is recognized only if some or all of the similar vector sequence is identified within this range. The boxes in blue indicate the vector-derived sequence. The yellow open boxes represent cDNA inserts and the green bars show the low quality regions. The small stars indicate where the number 1 base is located by CVS coordinates. The boxes in red specify the product of SeqClean trimming. Comments for each of the three listed trimming situations are denoted to their right. Condition A indicates those ESTs which were mistakenly trashed. Condition B shows incomplete trimming and condition C is an example of correct trimming. Example ESTs corresponding to each of the three conditions are shown in the table below, where the position numbering followed the coordinates of the untrimmed EST sequences.

**Figure 2**
**Re-linearization of vector pGEM-T at its cloning site**. A. Simplified map of vector pGEM-T. The insert DNA of interest was cloned into position between bases 60 and 61. The primers were introduced with the DNA insert during cDNA preparation carried out in a wet lab. B. Vector sequence of pGEM-T before and after re-linearization Bases 1–198 and 2899–3015 were expressed. The omitted nucleotides are expressed as dotted lines. Additional nucleotides TA (colored in blue) were appended to the vector at position 60 during a wet-lab experimental procedure. The letters in pink boxes (bases 1 ~ 60 plus the appended T) were moved electronically to the end of the sequence for vector re-linearization.

See this image and copyright information in PMC

Cited by

Candidate olfaction genes identified within the Helicoverpa armigera Antennal Transcriptome.
Liu Y, Gu S, Zhang Y, Guo Y, Wang G. Liu Y, et al. PLoS One. 2012;7(10):e48260. doi: 10.1371/journal.pone.0048260. Epub 2012 Oct 26. PLoS One. 2012. PMID: 23110222 Free PMC article.
A second generation framework for the analysis of microsatellites in expressed sequence tags and the development of EST-SSR markers for a conifer, Cryptomeria japonica.
Ueno S, Moriguchi Y, Uchiyama K, Ujino-Ihara T, Futamura N, Sakurai T, Shinohara K, Tsumura Y. Ueno S, et al. BMC Genomics. 2012 Apr 16;13:136. doi: 10.1186/1471-2164-13-136. BMC Genomics. 2012. PMID: 22507374 Free PMC article.
Combined methylation mapping of 5mC and 5hmC during early embryonic stages in bovine.
de Montera B, Fournier E, Shojaei Saadi HA, Gagné D, Laflamme I, Blondin P, Sirard MA, Robert C. de Montera B, et al. BMC Genomics. 2013 Jun 18;14:406. doi: 10.1186/1471-2164-14-406. BMC Genomics. 2013. PMID: 23773395 Free PMC article.
Double-stranded RNA in the biological control of grain aphid (Sitobion avenae F.).
Wang D, Liu Q, Li X, Sun Y, Wang H, Xia L. Wang D, et al. Funct Integr Genomics. 2015 Mar;15(2):211-23. doi: 10.1007/s10142-014-0424-x. Epub 2014 Dec 3. Funct Integr Genomics. 2015. PMID: 25467938
Caecilians maintain a functional long-wavelength-sensitive cone opsin gene despite signatures of relaxed selection and more than 200 million years of fossoriality.
Méndez MJN, Amini SS, Santos JC, Saal J, Wake MH, Ron SR, Tarvin RD. Méndez MJN, et al. bioRxiv [Preprint]. 2025 Feb 8:2025.02.07.636964. doi: 10.1101/2025.02.07.636964. bioRxiv. 2025. PMID: 39975400 Free PMC article. Preprint.

See all "Cited by" articles

References

1. Bork P, Bairoch A. Go hunting in sequence databases but watch out for the traps. Trends Genet. 1996;12:425–427. doi: 10.1016/0168-9525(96)60040-7. - DOI - PubMed
1. Colleagues CTGoBMa Quality control in databanks for molecular biology. Bioessays. 2000;22:1024–1034. doi: 10.1002/1521-1878(200011)22:11<1024::AID-BIES9>3.0.CO;2-W. - DOI - PubMed
1. Seluja GA, Farmer A, McLeod M, Harger C, Schad PA. Establishing a method of vector contamination identification in database sequences. Bioinformatics. 1999;15:106–110. doi: 10.1093/bioinformatics/15.2.106. - DOI - PubMed
1. Lamperti ED, Kittelberger JM, Smith TF, Villa-Komaroff L. Corruption of genomic databases with anomalous sequence. Nucleic Acids Res. 1992;20:2741–2747. doi: 10.1093/nar/20.11.2741. - DOI - PMC - PubMed
1. Korning PG, Hebsgaard SM, Rouze P, Brunak S. Cleaning theGenBank Arabidopsis thaliana data set. Nucleic Acids Res. 1996;24:316–320. doi: 10.1093/nar/24.2.316. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An optimized procedure greatly improves EST vector contamination removal

Affiliation

An optimized procedure greatly improves EST vector contamination removal

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials