Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 13:6:e5175.
doi: 10.7717/peerj.5175. eCollection 2018.

SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

Affiliations

SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

Tobias Andermann et al. PeerJ. .

Abstract

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing technologies such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Keywords: Allele phasing; Assembly; BAM; Contig; Exon capture; FASTQ; Next generation sequencing (NGS); Phylogenetics; Phylogeography; Target capture.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. SECAPR analytical workflow.
The flowchart shows the basic SECAPR functions, which are separated into two steps (colored boxes). Blue box (1. reference library from raw data): in this step the raw reads are cleaned and assembled into contigs (de novo assembly); orange box (2. reference based assembly with custom reference library): the contigs from the previous step are used for reference-based assembly, enabling allele phasing and additional quality control options, e.g., concerning read-coverage. Black boxes show SECAPR commands and white boxes represent the input and output data of the respective function. Boxes marked in grey represent multiple sequence alignments (MSAs) generated with SECAPR, which can be used for phylogenetic inference.
Figure 2
Figure 2. Overview of FASTQc quality test result.
(A) Before and (B) after cleaning and adapter trimming of sequencing reads with the SECAPR function clean_reads. This plot, as produced by SECAPR, provides an overview of the complete dataset and helps to gauge if the chosen cleaning parameters are appropriate for the dataset. The summary plots show the FASTQc test results, divided into three categories: passed (green), warning (blue) and failed (red). The x-axis of all plots contains the eleven different quality tests (see legend). The bar-plots (‘count’) represent the counts of each test result (pass, warning or fail) across all samples. The matrix plots (‘samples’) show the test result of each test for each sample individually (y-axis). This information can be used to evaluate both, which specific parameters need to be adjusted and which samples are the most problematic.
Figure 3
Figure 3. Reference-based assembly including heterozygous sites.
BAM-assembly file as generated with the SECAPR reference_assembly function, shown exemplarily for one exon locus (1/837) of one of the Geonoma samples (1/17). The displayed assembly contains all FASTQ sequencing reads that could be mapped to the reference sequence. The reference sequence in this case is the de-novo contig that was matched to the reference exon ‘Elaeis 1064 3’. DNA bases are color-coded (A, green; C, blue; G, black; T, red). The enlarged section contains a heterozygous site, which likely represents allelic variation, as both variants A and G are found at approximately equal ratio.
Figure 4
Figure 4. Overview of sequence yield for Geonoma sample data, produced with SECAPR.
The matrix plots show an overview of the contig yield and read-coverage for all targeted loci (A + B) and for the selection of the 50 loci with the best read coverage (C + D), selected with the SECAPR function locus_selection (see Table S5 for loci-names corresponding to indices on x-axes). (A) and (C) show if de novo contigs could be assembled (blue) or not (white) for the respective locus (column) and sample (row). Contig MSAs were generated for all loci that could be recovered for at least three samples (green). (B) and (D) show the read coverage (see legend) for each exon locus after reference-based assembly. The reference library for the assembly consisted of the consensus sequences of each contig MSA, and hence is genus specific for Geonoma.

References

    1. Andermann T, Fernandes AM, Olsson U, Töpel M, Pfeil B, Oxelman B, Aleixo A, Faircloth BC, Antonelli A, Renner S. Allele phasing greatly improves the phylogenetic utility of ultraconserved elements. Systematic Biology. 2018 doi: 10.1093/sysbio/syy039. Epub ahead of print May 15 2018. - DOI - PMC - PubMed
    1. BabrahamBioinformatics FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [2 June]. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    1. Bi K, Vanderpool D, Singhal S, Linderoth T, Moritz C, Good JM. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics. 2012;13:403. doi: 10.1186/1471-2164-13-403. - DOI - PMC - PubMed
    1. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. - DOI - PMC - PubMed
    1. Botero-Castro F, Tilak MK, Justy F, Catzeflis F, Delsuc F, Douzery EJP. Next-generation sequencing and phylogenetic signal of complete mitochondrial genomes for resolving the evolutionary history of leaf-nosed bats (Phyllostomidae) Molecular Phylogenetics and Evolution. 2013;69:728–739. doi: 10.1016/j.ympev.2013.07.003. - DOI - PubMed