Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 8;10(6):878.
doi: 10.3390/biom10060878.

High-Throughput Identification of Adapters in Single-Read Sequencing Data

Affiliations

High-Throughput Identification of Adapters in Single-Read Sequencing Data

Asan M S H Mohideen et al. Biomolecules. .

Abstract

Sequencing datasets available in public repositories are already high in number, and their growth is exponential. Raw sequencing data files constitute a substantial portion of these data, and they need to be pre-processed for any downstream analyses. The removal of adapter sequences is the first essential step. Tools available for the automated detection of adapters in single-read sequencing protocol datasets have certain limitations. To explore these datasets, one needs to retrieve the information on adapter sequences from the methods sections of appropriate research articles. This can be time-consuming in metadata analyses. Moreover, not all research articles provide the information on adapter sequences. We have developed adapt_find, a tool that automates the process of adapter sequences identification in raw single-read sequencing datasets. We have verified adapt_find through testing a number of publicly available datasets. adapt_find secures a robust, reliable and high-throughput process across different sequencing technologies and various adapter designs. It does not need prior knowledge of the adapter sequences. We also produced associated tools: random_mer, for the detection of random N bases either on one or both termini of the reads, and fastqc_parser, for consolidating the results from FASTQC outputs. Together, this is a valuable tool set for metadata analyses on multiple sequencing datasets.

Keywords: 454 pyrosequencing; Illumina; Ion-Torrent; SOLiD; adapter oligonucleotides; adapter trimming; randomized adapters; single-read sequencing; small RNA sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Essential information on adapter sequences for effective trimming process. The 5′ end adapter is usually of constant length, while the lengths of 3′ end adapters may vary in a dataset. The exact match to the 3′ end (tail part, adjacent to a biological sequence) of the 5′ end adapter, and to the 5′ end (head part, adjacent to a biological sequence) are required to identify the adapters. In the latter case, the shortest variant of the 3′ end adapter suffices.
Figure 2
Figure 2
Schematic representation of output reads data format in different sequencing technologies. In all the four sequencing technologies, 3′ end adapters are ligated to biological sequences in the sequencing outputs; in addition, 5′ end adapters are present in Ion Torrent and 454 pyrosequencing outputs. The Illumina output reads may have four-letter barcode in the 5′ end, and/or random 4 “N” nucleotides at both ends. Similarly, depending on the library preparation kit used, output reads from Ion Torrent might have random 5-mer and a three-letter barcode in addition to 5′ end and 3′ end adapters.
Figure 3
Figure 3
The adapt_find workflow. Black boxes: general procedure, green boxes: exit step, blue boxes: alternative strategy, yellow diamonds: decision.
Figure 4
Figure 4
random_mer workflow. Black boxes: general procedure, green boxes: exit step, orange boxes: further recommended process, yellow diamonds: decision.

References

    1. Quail M.A., Kozarewa I., Smith F., Scally A., Stephens P.J., Durbin R., Swerdlow H., Turner D.J. A large genome center’s improvements to the Illumina sequencing system. Nat. Methods. 2008;5:1005–1010. doi: 10.1038/nmeth.1270. - DOI - PMC - PubMed
    1. Head S.R., Komori H.K., LaMere S.A., Whisenant T., Van Nieuwerburgh F., Salomon D.R., Ordoukhanian P. Library construction for next-generation sequencing: Overviews and challenges. BioTechniques. 2014;56:61–passim. doi: 10.2144/000114133. - DOI - PMC - PubMed
    1. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011;17:10–12. doi: 10.14806/ej.17.1.200. - DOI
    1. Jayaprakash A.D., Jabado O., Brown B.D., Sachidanandam R. Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic Acids Res. 2011;39:e141. doi: 10.1093/nar/gkr693. - DOI - PMC - PubMed
    1. Simon A. FastQC: A Quality Control Tool for High Throughput Sequence Data. [(accessed on 17 March 2020)]; Available online: https://archive.st/archive/2020/3/www.bioinformatics.babraham.ac.uk/4af3....

Publication types

MeSH terms

Substances

LinkOut - more resources