Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Dec 9:2023.03.20.533521.
doi: 10.1101/2023.03.20.533521.

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Affiliations

Flexible parsing, interpretation, and editing of technical sequences with splitcode

Delaney K Sullivan et al. bioRxiv. .

Update in

Abstract

Next-generation sequencing libraries are constructed with numerous synthetic constructs such as sequencing adapters, barcodes, and unique molecular identifiers. Such sequences can be essential for interpreting results of sequencing assays, and when they contain information pertinent to an experiment, they must be processed and analyzed. We present a tool called splitcode, that enables flexible and efficient parsing, interpreting, and editing of sequencing reads. This versatile tool facilitates simple, reproducible preprocessing of reads from libraries constructed for a large array of single-cell and bulk sequencing assays.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Overview of the splitcode workflow. The splitcode program takes in a set of FastQ files and a user-specified config file, which serves as a recipe describing how the reads should be parsed. The user executes splitcode on the command-line, specifying command-line options on how the output should be formatted. The output consists of one or more of the following: the original FastQ files (possibly edited), the extracted sequences (e.g. UMI sequences which are unknown and need to be extracted by using location information or anchor points), and the final barcodes which are unique for each combination of identified tags. The output may take the form of FastQ files, gzip-compressed FastQ files, or interleaved sequences directed to standard output, depending on what the user specifies.
Figure 2:
Figure 2:
Example of splitcode usage. The structure of the reads from this hypothetical sequencing technology contains multiple regions that need to be parsed, including some of variable length. In the config file, each region that needs to be parsed is organized into “groups” and each group contains multiple tags. The tags in the grp_A group have the value 1 in the “distances” column, meaning a hamming distance 1 error tolerance. The values in the “next” column indicate that after a grp_A tag (i.e. Barcode_A1, Barcode_A2, or Barcode_A3) is found, we should next search only for tags in the grp_B group. The “maxFindsG” values of 1 mean that the maximum number of times a specific group can be found is 1 (e.g. after finding a tag in grp_A, stop searching for tags in grp_A). The “locations” for grp_A tags have the value 0:0:5, meaning that the tag is found in file #0 (i.e. the R1 file) within positions 0–5 of the read; for grp_B tags, splitcode searches file #0 within positions 5–100. In the header of the config file, the @extract option contains an expression indicating that we should extract an 8-bp sequence, which we name umi, 3 bases following identification of a grp_B tag. The supplied @trim-3 option means that only 3′-end trimming of 0 bases and 4 bases of the R1 file and the R2 file, respectively, should be performed. As output, the “Final Barcodes” FastQ file contains a sequence uniquely identifying a combination of tags and the mapping file allows us to map the final barcode sequence back to the tag combination (the numbers in the right-most column of the mapping file represent how many reads that tag combination was found in). Finally, it is important to note that this is simply one of many ways to parse this read structure with splitcode and users can configure the options how they save fit. Further, users can also customize the output options (for example, users can choose to output reads that contain both grp_A and grp_B tags into one set of files and direct all other reads into a separate set of files, and users can choose whether to output the 8-bp UMI sequence into an independent file or to put it in the FastQ header of the outputted reads as a SAM tag).
Figure 3:
Figure 3:
The splitcode graphical user interface (GUI). The GUI can be viewed in a web browser and is designed to facilitate creation of the splitcode config file and navigation of output options. The GUI also features live testing of the splitcode program on user-supplied sample sequences in FastQ format.

References

    1. Battenberg Kai, Kelly S. Thomas, Ras Radu Abu, Hetherington Nicola A., Hayashi Makoto, and Minoda Aki. 2022. “A Flexible Cross-Platform Single-Cell Data Processing Pipeline.” Nature Communications 13 (1): 6847. - PMC - PubMed
    1. Bolger Anthony M., Lohse Marc, and Usadel Bjoern. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Bioinformatics 30 (15): 2114–20. - PMC - PubMed
    1. Booeshaghi A. Sina, Chen Xi, and Pachter Lior. 2023. “A Machine-Readable Specification for Genomics Assays.” bioRxiv : The Preprint Server for Biology, March. 10.1101/2023.03.17.533215. - DOI - PMC - PubMed
    1. Bushnell Brian. 2014. “BBMap.” https://sourceforge.net/projects/bbmap/.
    1. Cao Junyue, Spielmann Malte, Qiu Xiaojie, Huang Xingfan, Ibrahim Daniel M., Hill Andrew J., Zhang Fan, et al. 2019. “The Single-Cell Transcriptional Landscape of Mammalian Organogenesis.” Nature 566 (7745): 496–502. - PMC - PubMed

Publication types