Flexible parsing, interpretation, and editing of technical sequences with splitcode

Delaney K Sullivan^{1

2}, Lior Pachter^{2

3}

Affiliations

¹ UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, United States.
² Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, United States.
³ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, United States.

PMID: 38876979
PMCID: PMC11193061
DOI: 10.1093/bioinformatics/btae331

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Delaney K Sullivan et al. Bioinformatics. 2024.

. 2024 Jun 3;40(6):btae331.

doi: 10.1093/bioinformatics/btae331.

Authors

Delaney K Sullivan^{1

2}, Lior Pachter^{2

3}

Affiliations

¹ UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, United States.
² Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, United States.
³ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA 91125, United States.

PMID: 38876979
PMCID: PMC11193061
DOI: 10.1093/bioinformatics/btae331

Abstract

Motivation: Next-generation sequencing libraries are constructed with numerous synthetic constructs such as sequencing adapters, barcodes, and unique molecular identifiers. Such sequences can be essential for interpreting results of sequencing assays, and when they contain information pertinent to an experiment, they must be processed and analyzed.

Results: We present a tool called splitcode, that enables flexible and efficient parsing, interpreting, and editing of sequencing reads. This versatile tool facilitates simple, reproducible preprocessing of reads from libraries constructed for a large array of single-cell and bulk sequencing assays.

Availability and implementation: The splitcode program is available at http://github.com/pachterlab/splitcode.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Overview of the *splitcode* workflow. The *splitcode* program takes in a set of FASTQ files and a user-specified config file, which serves as a recipe describing how the reads should be parsed. The user executes *splitcode* on the command-line, specifying command-line options on how the output should be formatted. The output consists of one or more of the following: the original FASTQ files (possibly edited), the extracted sequences (e.g. UMI sequences which are unknown and need to be extracted by using location information or anchor points), and the final barcodes which are unique for each combination of identified tags. The output may take the form of FASTQ files, gzip-compressed FASTQ files, BAM files, or interleaved sequences directed to standard output, depending on what the user specifies.

**Figure 2.**
Example of *splitcode* usage. The structure of the reads from this hypothetical sequencing technology contains multiple regions that need to be parsed, including some of variable length. In the config file, each region that needs to be parsed is organized into groups and each “group” contains multiple tags. The tags in the grp_A group have the value 1 in the “distance” column, meaning a hamming distance 1 error tolerance. The values in the “next” column indicate that after a grp_A tag (i.e. Barcode_A1, Barcode_A2, or Barcode_A3) is found, we should next search only for tags in the grp_B group. The “maxFindsG” values of 1 mean that the maximum number of times a specific group can be found is 1 (e.g. after finding a tag in grp_A, stop searching for tags in grp_A). The “location” for grp_A tags have the value 0:0:5, meaning that the tag is found in file #0 (i.e. the R1 file) within positions 0–5 of the read; for grp_B tags, splitcode searches file #0 within positions 5–100. In the header of the config file, the @extract option contains an expression indicating that we should extract an 8-bp sequence, which we name umi, 3 bases following identification of a grp_B tag. The supplied @trim-3 option means that only 3′-end trimming of 0 bases and 4 bases of the R1 file and the R2 file, respectively, should be performed. Thus, here, the output R1 file will contain the original R1 sequences (i.e. the entirety of Barcode A, Region 1, Barcode B, NNN, UMI, and Region 2) while the output R2 file will contain just the cDNA. The output “Final Barcodes” FASTQ file will contain a sequence uniquely identifying a combination of tags and the mapping file allows us to map the final barcode sequence back to the tag combination (the numbers in the right-most column of the mapping file represent how many reads that tag combination was found in). Finally, it is important to note that this is simply one of many ways to parse this read structure with splitcode and users can configure the options how they see fit. Further, users can also customize the output options. For example, users can choose to output reads that contain both grp_A and grp_B tags into one set of files and direct all other reads into a separate set of files, and users can choose whether to output the 8-bp UMI sequence into an independent file or to put it in the FASTQ header of the outputted reads. Users also have the option to output reads as a BAM file with the 8-bp UMI sequence encoded in a SAM tag.

See this image and copyright information in PMC

Update of

Flexible parsing, interpretation, and editing of technical sequences with splitcode.
Sullivan DK, Pachter L. Sullivan DK, et al. bioRxiv [Preprint]. 2023 Dec 9:2023.03.20.533521. doi: 10.1101/2023.03.20.533521. bioRxiv. 2023. Update in: Bioinformatics. 2024 Jun 3;40(6):btae331. doi: 10.1093/bioinformatics/btae331. PMID: 36993532 Free PMC article. Updated. Preprint.

References

1. Battenberg K, Kelly ST, Ras RA. et al. A flexible cross-platform single-cell data processing pipeline. Nat Commun 2022;13:6847. - PMC - PubMed
1. Bolger AM, Lohse M, Usadel B.. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014;30:2114–20. - PMC - PubMed
1. Booeshaghi AS, Chen X, Pachter L.. A machine-readable specification for genomics assays. Bioinformatics 2024;40:btae168. 10.1093/bioinformatics/btae168 - DOI - PMC - PubMed
1. Bushnell B, Rood J, Singer E.. BBMerge—accurate paired shotgun read merging via overlap. PLoS One 2017;12:e0185056. - PMC - PubMed
1. Chen S, Zhou Y, Chen Y. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Affiliations

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources