This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2023 Dec 9:2023.03.20.533521.

doi: 10.1101/2023.03.20.533521.

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Delaney K Sullivan^{1

2}, Lior Pachter^{2

3}

Affiliations

¹ UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
² Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA.
³ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA.

PMID: 36993532
PMCID: PMC10055216
DOI: 10.1101/2023.03.20.533521

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Delaney K Sullivan et al. bioRxiv. 2023.

[Preprint]. 2023 Dec 9:2023.03.20.533521.

doi: 10.1101/2023.03.20.533521.

Authors

Delaney K Sullivan^{1

2}, Lior Pachter^{2

3}

Affiliations

¹ UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
² Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA.
³ Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA.

PMID: 36993532
PMCID: PMC10055216
DOI: 10.1101/2023.03.20.533521

Update in

Flexible parsing, interpretation, and editing of technical sequences with splitcode.
Sullivan DK, Pachter L. Sullivan DK, et al. Bioinformatics. 2024 Jun 3;40(6):btae331. doi: 10.1093/bioinformatics/btae331. Bioinformatics. 2024. PMID: 38876979 Free PMC article.

Abstract

Next-generation sequencing libraries are constructed with numerous synthetic constructs such as sequencing adapters, barcodes, and unique molecular identifiers. Such sequences can be essential for interpreting results of sequencing assays, and when they contain information pertinent to an experiment, they must be processed and analyzed. We present a tool called splitcode, that enables flexible and efficient parsing, interpreting, and editing of sequencing reads. This versatile tool facilitates simple, reproducible preprocessing of reads from libraries constructed for a large array of single-cell and bulk sequencing assays.

PubMed Disclaimer

Figures

**Figure 1:**
Overview of the splitcode workflow. The splitcode program takes in a set of FastQ files and a user-specified config file, which serves as a recipe describing how the reads should be parsed. The user executes splitcode on the command-line, specifying command-line options on how the output should be formatted. The output consists of one or more of the following: the original FastQ files (possibly edited), the extracted sequences (e.g. UMI sequences which are unknown and need to be extracted by using location information or anchor points), and the final barcodes which are unique for each combination of identified tags. The output may take the form of FastQ files, gzip-compressed FastQ files, or interleaved sequences directed to standard output, depending on what the user specifies.

**Figure 2:**
Example of splitcode usage. The structure of the reads from this hypothetical sequencing technology contains multiple regions that need to be parsed, including some of variable length. In the config file, each region that needs to be parsed is organized into “groups” and each group contains multiple tags. The tags in the grp_A group have the value 1 in the “distances” column, meaning a hamming distance 1 error tolerance. The values in the “next” column indicate that after a grp_A tag (i.e. Barcode_A1, Barcode_A2, or Barcode_A3) is found, we should next search only for tags in the grp_B group. The “maxFindsG” values of 1 mean that the maximum number of times a specific group can be found is 1 (e.g. after finding a tag in grp_A, stop searching for tags in grp_A). The “locations” for grp_A tags have the value 0:0:5, meaning that the tag is found in file #0 (i.e. the R1 file) within positions 0–5 of the read; for grp_B tags, splitcode searches file #0 within positions 5–100. In the header of the config file, the @extract option contains an expression indicating that we should extract an 8-bp sequence, which we name umi, 3 bases following identification of a grp_B tag. The supplied @trim-3 option means that only 3′-end trimming of 0 bases and 4 bases of the R1 file and the R2 file, respectively, should be performed. As output, the “Final Barcodes” FastQ file contains a sequence uniquely identifying a combination of tags and the mapping file allows us to map the final barcode sequence back to the tag combination (the numbers in the right-most column of the mapping file represent how many reads that tag combination was found in). Finally, it is important to note that this is simply one of many ways to parse this read structure with splitcode and users can configure the options how they save fit. Further, users can also customize the output options (for example, users can choose to output reads that contain both grp_A and grp_B tags into one set of files and direct all other reads into a separate set of files, and users can choose whether to output the 8-bp UMI sequence into an independent file or to put it in the FastQ header of the outputted reads as a SAM tag).

**Figure 3:**
The splitcode graphical user interface (GUI). The GUI can be viewed in a web browser and is designed to facilitate creation of the splitcode config file and navigation of output options. The GUI also features live testing of the splitcode program on user-supplied sample sequences in FastQ format.

See this image and copyright information in PMC

References

1. Battenberg Kai, Kelly S. Thomas, Ras Radu Abu, Hetherington Nicola A., Hayashi Makoto, and Minoda Aki. 2022. “A Flexible Cross-Platform Single-Cell Data Processing Pipeline.” Nature Communications 13 (1): 6847. - PMC - PubMed
1. Bolger Anthony M., Lohse Marc, and Usadel Bjoern. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. - PMC - PubMed
1. Booeshaghi A. Sina, Chen Xi, and Pachter Lior. 2023. “A Machine-Readable Specification for Genomics Assays.” bioRxiv : The Preprint Server for Biology, March. 10.1101/2023.03.17.533215. - DOI - PMC - PubMed
1. Bushnell Brian. 2014. “BBMap.” https://sourceforge.net/projects/bbmap/.
1. Cao Junyue, Spielmann Malte, Qiu Xiaojie, Huang Xingfan, Ibrahim Daniel M., Hill Andrew J., Zhang Fan, et al. 2019. “The Single-Cell Transcriptional Landscape of Mammalian Organogenesis.” Nature 566 (7745): 496–502. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Affiliations

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Authors

Affiliations

Update in

Abstract

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources