Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep 26:13:247.
doi: 10.1186/1471-2105-13-247.

ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing

Affiliations

ESTclean: a cleaning tool for next-gen transcriptome shotgun sequencing

Hongseok Tae et al. BMC Bioinformatics. .

Abstract

Background: With the advent of next-generation sequencing (NGS) technologies, full cDNA shotgun sequencing has become a major approach in the study of transcriptomes, and several different protocols in 454 sequencing have been invented. As each protocol uses its own short DNA tags or adapters attached to the ends of cDNA fragments for labeling or sequencing, different contaminants may lead to mis-assembly and inaccurate sequence products.

Results: We have designed and implemented a new program for raw sequence cleaning in a graphical user interface and a batch script. The cleaning process consists of several modules including barcode trimming, sequencing adapter trimming, amplification primer trimming, poly-A tail trimming, vector screening and low quality region trimming. These modules can be combined based on various sequencing applications.

Conclusions: ESTclean is a software package not only for cleaning cDNA sequences, but also for helping to develop sequencing protocols by providing summary tables and figures for sequencing quality control in a graphical user interface. It outperforms in cleaning read sequences from complicated sequencing protocols which use barcodes and multiple amplification primers.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Amplification primer trimming. When a primer matches partially to the end of a read, ESTclean checks the minimum percent identity and length of the match and the minimum number of unaligned bases in the primer (a) and read (b).
Figure 2
Figure 2
Poly A tail trimming. A poly A tail is recognized by the length (Lp) of the tail, the ratio of the number (NA) of As to Lp, and the length (Le) of the 3’ end.
Figure 3
Figure 3
ESTclean screenshot. The left panel displays the steps and progress in a cleaning process. On the right panel, the options tab is used for specifying a sequence and quality score files, an output directory, and parameters for all cleaning modules. The statistics tab shows various statistics of cleaning for quality control. The bottom panel displays messages and errors during cleaning processes.
Figure 4
Figure 4
Summary tables and figures. For validation of final products, several charts and tables are provided in order to display statistical information of trimming results. A: The numbers of reads and bases, and minimum and maximum read lengths for each cleaning step. B: The distribution of read lengths for each cleaning step. C: The distribution of quality scores for each cleaning step. D: The percentage of top 30 k-mers in cleaned sequences. E: The histogram of primer matches. F: The number of good and bad reads in terms of primer combinations. G: The number of primers identified at each base position. H: The histogram of lengths of trimmed poly A tails and T heads.
Figure 5
Figure 5
Erroneous read types.RF: reverse and forward matched primers in the 5’ and 3’ ends respectively; fr: forward and reverse matches of the same primer; SF: forward match in the 5’ end but with unaligned bases before it; RE: reverse match in the 3’ end but with unaligned bases after it; NF: multiple forward matches; NR: multiple reverse matches.
Figure 6
Figure 6
Evaluation method. Mapping results, A(E) and A(S), by GMAP for reads, E and S, cleaned by SeqClean and ESTclean respectively are evaluated to decide whether the reads are over- or under-trimmed. At the 5’ end, while SeqClean performs correct trimming, the read from ESTclean is under-trimmed as its 5’ end is not aligned to the genome. At the 3’ end, ESTclean over-trims while SeqClean under-trims because the latter has unaligned bases and the trimmed region of the former is real (aligned).
Figure 7
Figure 7
Experiment workflow. Since SeqClean cannot trim a barcode, barcode-trimmed reads by ESTclean were used as input data. The cleaned reads by SeqClean and ESTclean were mapped to the reference genome using GMAP. We filtered out multiply mapped reads and non-overlapping reads by at least 40bp in the genome. Finally 1,290,547 reads were used for evaluation.
Figure 8
Figure 8
Histogram of over- and under-trimmed lengths in the 5’ (left) and 3’ (right) ends. The positive and negative X axes represent over- and under-trimming respectively. The dotted red line represents the cumulative difference in the number of over- and under-trimmed reads between ESTclean and SeqClean.

Similar articles

Cited by

References

    1. Schuster SC. Next-generation sequencing transforms today’s biology. Nat Meth. 2008;5:16–18. doi: 10.1038/nmeth1156. - DOI - PubMed
    1. Meyer E, Aglyamova G, Wang S, Buchanan-Carter J, Abrego D, Colbourne J, Willis B, Matz M. Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009;10:219. doi: 10.1186/1471-2164-10-219. - DOI - PMC - PubMed
    1. VecScreen. http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html.
    1. Chou HH, Holmes MH. DNA sequence quality trimming and vector removal. Bioinformatics. 2001;17(12):1093–1104. doi: 10.1093/bioinformatics/17.12.1093. - DOI - PubMed
    1. Cross_match. http://www.phrap.org/phredphrapconsed.html.

Publication types