An optimized procedure greatly improves EST vector contamination removal
- PMID: 17997864
- PMCID: PMC2194723
- DOI: 10.1186/1471-2164-8-416
An optimized procedure greatly improves EST vector contamination removal
Abstract
Background: The enormous amount of sequence data available in the public domain database has been a gold mine for researchers exploring various themes in life sciences, and hence the quality of such data is of serious concern to researchers. Removal of vector contamination is one of the most significant operations to obtain accurate sequence data containing only a cDNA insert from the basecalls output by an automatic DNA sequencer. Popular bioinformatics programs to accomplish vector trimming include LUCY, cross_match and SeqClean.
Results: In a recent study, where the program SeqClean was used to remove vector contamination from our test set of EST data compiled through various library construction systems, however, a significant number of errors remained after preliminary trimming. These errors were later almost completely corrected by simply using a re-linearized form of the cloning vector to compare against the target ESTs. The modified trimming procedure for SeqClean was also compared with the trimming efficiency of the other two popular programs, LUCY2, and cross_match. Using SeqClean with a re-linearized form of the cloning vector significantly surpassed the other two programs in all tested conditions, while the performance of the other two programs was not influenced by the modified procedure. Vector contamination in dbEST was also investigated in this study: 2203 out of the 48212 ESTs sampled from dbEST (2007-04-18 freeze) were found to match sequences in UNIVEC.
Conclusion: Vector contamination remains a serious concern to the data quality in the public sequence database nowadays. Based on the results presented here, we feel that our modified procedure with SeqClean should be recommended to all researchers for the task of vector removal from EST or genomic sequences.
Figures


Similar articles
-
Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data.BMC Biotechnol. 2012 May 3;12:16. doi: 10.1186/1472-6750-12-16. BMC Biotechnol. 2012. PMID: 22554190 Free PMC article.
-
Peanut gene expression profiling in developing seeds at different reproduction stages during Aspergillus parasiticus infection.BMC Dev Biol. 2008 Feb 4;8:12. doi: 10.1186/1471-213X-8-12. BMC Dev Biol. 2008. PMID: 18248674 Free PMC article.
-
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].Yi Chuan Xue Bao. 2004 May;31(5):431-43. Yi Chuan Xue Bao. 2004. PMID: 15478601 Chinese.
-
ConiferEST: an integrated bioinformatics system for data reprocessing and mining of conifer expressed sequence tags (ESTs).BMC Genomics. 2007 May 29;8:134. doi: 10.1186/1471-2164-8-134. BMC Genomics. 2007. PMID: 17535431 Free PMC article.
-
Rapid in silico cloning of genes using expressed sequence tags (ESTs).Biotechnol Annu Rev. 2000;5:25-44. doi: 10.1016/s1387-2656(00)05031-6. Biotechnol Annu Rev. 2000. PMID: 10874996 Review.
Cited by
-
Candidate olfaction genes identified within the Helicoverpa armigera Antennal Transcriptome.PLoS One. 2012;7(10):e48260. doi: 10.1371/journal.pone.0048260. Epub 2012 Oct 26. PLoS One. 2012. PMID: 23110222 Free PMC article.
-
A second generation framework for the analysis of microsatellites in expressed sequence tags and the development of EST-SSR markers for a conifer, Cryptomeria japonica.BMC Genomics. 2012 Apr 16;13:136. doi: 10.1186/1471-2164-13-136. BMC Genomics. 2012. PMID: 22507374 Free PMC article.
-
Combined methylation mapping of 5mC and 5hmC during early embryonic stages in bovine.BMC Genomics. 2013 Jun 18;14:406. doi: 10.1186/1471-2164-14-406. BMC Genomics. 2013. PMID: 23773395 Free PMC article.
-
Double-stranded RNA in the biological control of grain aphid (Sitobion avenae F.).Funct Integr Genomics. 2015 Mar;15(2):211-23. doi: 10.1007/s10142-014-0424-x. Epub 2014 Dec 3. Funct Integr Genomics. 2015. PMID: 25467938
-
Caecilians maintain a functional long-wavelength-sensitive cone opsin gene despite signatures of relaxed selection and more than 200 million years of fossoriality.bioRxiv [Preprint]. 2025 Feb 8:2025.02.07.636964. doi: 10.1101/2025.02.07.636964. bioRxiv. 2025. PMID: 39975400 Free PMC article. Preprint.
References
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials