SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

Tobias Andermann^{1

2}, Ángela Cano^{2

3}, Alexander Zizka^{1

2}, Christine Bacon^{1

2}, Alexandre Antonelli^{1

2

4

5}

Affiliations

¹ Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden.
² Gothenburg Global Biodiversity Centre, Gothenburg, Sweden.
³ Department of Botany and Plant Biology, University of Geneva, Geneva, Switzerland.
⁴ Gothenburg Botanical Garden, Gothenburg, Sweden.
⁵ Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, United States of America.

PMID: 30023140
PMCID: PMC6047508
DOI: 10.7717/peerj.5175

SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

Tobias Andermann et al. PeerJ. 2018.

. 2018 Jul 13:6:e5175.

doi: 10.7717/peerj.5175. eCollection 2018.

Authors

Tobias Andermann^{1

2}, Ángela Cano^{2

3}, Alexander Zizka^{1

2}, Christine Bacon^{1

2}, Alexandre Antonelli^{1

2

4

5}

Affiliations

¹ Department of Biological and Environmental Sciences, University of Gothenburg, Gothenburg, Sweden.
² Gothenburg Global Biodiversity Centre, Gothenburg, Sweden.
³ Department of Botany and Plant Biology, University of Geneva, Geneva, Switzerland.
⁴ Gothenburg Botanical Garden, Gothenburg, Sweden.
⁵ Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, United States of America.

PMID: 30023140
PMCID: PMC6047508
DOI: 10.7717/peerj.5175

Abstract

Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing technologies such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus Geonoma (Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.

Keywords: Allele phasing; Assembly; BAM; Contig; Exon capture; FASTQ; Next generation sequencing (NGS); Phylogenetics; Phylogeography; Target capture.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

**Figure 1. SECAPR analytical workflow.**
The flowchart shows the basic SECAPR functions, which are separated into two steps (colored boxes). Blue box (1. reference library from raw data): in this step the raw reads are cleaned and assembled into contigs (*de novo* assembly); orange box (2. reference based assembly with custom reference library): the contigs from the previous step are used for reference-based assembly, enabling allele phasing and additional quality control options, e.g., concerning read-coverage. Black boxes show SECAPR commands and white boxes represent the input and output data of the respective function. Boxes marked in grey represent multiple sequence alignments (MSAs) generated with SECAPR, which can be used for phylogenetic inference.

**Figure 2. Overview of FASTQc quality test result.**
(A) Before and (B) after cleaning and adapter trimming of sequencing reads with the SECAPR function *clean_reads*. This plot, as produced by SECAPR, provides an overview of the complete dataset and helps to gauge if the chosen cleaning parameters are appropriate for the dataset. The summary plots show the FASTQc test results, divided into three categories: passed (green), warning (blue) and failed (red). The x-axis of all plots contains the eleven different quality tests (see legend). The bar-plots (‘count’) represent the counts of each test result (pass, warning or fail) across all samples. The matrix plots (‘samples’) show the test result of each test for each sample individually (y-axis). This information can be used to evaluate both, which specific parameters need to be adjusted and which samples are the most problematic.

**Figure 3. Reference-based assembly including heterozygous sites.**
BAM-assembly file as generated with the SECAPR *reference_assembly* function, shown exemplarily for one exon locus (1/837) of one of the *Geonoma* samples (1/17). The displayed assembly contains all FASTQ sequencing reads that could be mapped to the reference sequence. The reference sequence in this case is the *de-novo* contig that was matched to the reference exon ‘Elaeis 1064 3’. DNA bases are color-coded (A, green; C, blue; G, black; T, red). The enlarged section contains a heterozygous site, which likely represents allelic variation, as both variants A and G are found at approximately equal ratio.

**Figure 4. Overview of sequence yield for *Geonoma* sample data, produced with SECAPR.**
The matrix plots show an overview of the contig yield and read-coverage for all targeted loci (A + B) and for the selection of the 50 loci with the best read coverage (C + D), selected with the SECAPR function *locus_selection* (see Table S5 for loci-names corresponding to indices on x-axes). (A) and (C) show if *de novo* contigs could be assembled (blue) or not (white) for the respective locus (column) and sample (row). Contig MSAs were generated for all loci that could be recovered for at least three samples (green). (B) and (D) show the read coverage (see legend) for each exon locus after reference-based assembly. The reference library for the assembly consisted of the consensus sequences of each contig MSA, and hence is genus specific for *Geonoma*.

See this image and copyright information in PMC

References

1. Andermann T, Fernandes AM, Olsson U, Töpel M, Pfeil B, Oxelman B, Aleixo A, Faircloth BC, Antonelli A, Renner S. Allele phasing greatly improves the phylogenetic utility of ultraconserved elements. Systematic Biology. 2018 doi: 10.1093/sysbio/syy039. Epub ahead of print May 15 2018. - DOI - PMC - PubMed
1. BabrahamBioinformatics FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [2 June]. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
1. Bi K, Vanderpool D, Singhal S, Linderoth T, Moritz C, Good JM. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics. 2012;13:403. doi: 10.1186/1471-2164-13-403. - DOI - PMC - PubMed
1. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. - DOI - PMC - PubMed
1. Botero-Castro F, Tilak MK, Justy F, Catzeflis F, Delsuc F, Douzery EJP. Next-generation sequencing and phylogenetic signal of complete mitochondrial genomes for resolving the evolutionary history of leaf-nosed bats (Phyllostomidae) Molecular Phylogenetics and Evolution. 2013;69:728–739. doi: 10.1016/j.ympev.2013.07.003. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

Affiliations

SECAPR-a bioinformatics pipeline for the rapid and user-friendly processing of targeted enriched Illumina sequences, from raw reads to alignments

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous