Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Mar 4;40(3):btae102.
doi: 10.1093/bioinformatics/btae102.

Flexiplex: a versatile demultiplexer and search tool for omics data

Affiliations

Flexiplex: a versatile demultiplexer and search tool for omics data

Oliver Cheng et al. Bioinformatics. .

Abstract

Motivation: The process of analyzing high throughput sequencing data often requires the identification and extraction of specific target sequences. This could include tasks, such as identifying cellular barcodes and UMIs in single-cell data, and specific genetic variants for genotyping. However, existing tools, which perform these functions are often task-specific, such as only demultiplexing barcodes for a dedicated type of experiment, or are not tolerant to noise in the sequencing data.

Results: To overcome these limitations, we developed Flexiplex, a versatile and fast sequence searching and demultiplexing tool for omics data, which is based on the Levenshtein distance and thus allows imperfect matches. We demonstrate Flexiplex's application on three use cases, identifying cell-line-specific sequences in Illumina short-read single-cell data, and discovering and demultiplexing cellular barcodes from noisy long-read single-cell RNA-seq data. We show that Flexiplex achieves an excellent balance of accuracy and computational efficiency compared to leading task-specific tools.

Availability and implementation: Flexiplex is available at https://davidsongroup.github.io/flexiplex/.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) The demultiplexing approach used by Flexiplex. The right and left flank are first searched for within a read. The barcode and UMI regions are then extracted from the intermediate sequence, with barcode error correction if known barcodes are provided. (B) UMAP of the short-read single-cell dataset of seven pooled cell lines. Cells positive for BCAS4-BCAS3, Adenovirus 5 EA1, and rs878887783 are indicated. (C) The number of cells identified with grep, seqkit grep, ugrep, and Flexiplex that express sequence from BCAS4-BCAS3 (SNP—using an MCF-7-specific variant or Reference—using the reference allele), Adenovirus 5 EA1, and rs878887783 in a short-read single-cell dataset of seven pooled cells lines. Cells, which cluster away from the presumed cluster (hatched), are likely to be false positives, whereas those falling within the presumed cluster are true positives (values on bars). (D) The accuracy of barcode demultiplexing on a simulated set of 5 million single-cell RNA-seq long reads for Flexiplex, scTagger, and FLAMES, varying the maximum allowed edit distance to known barcodes between zero and three. (E) Assessment of cellular barcode demultiplexing on a real dataset of 248 cells sequenced with ONT for Flexiplex (with and without chimeric read splitting), scTagger, and FLAMES, varying the maximum allowed edit distance to known barcodes between zero and three. Correct barcodes will result in a higher level of consistent cell-line annotation. (F) Performance of Flexiplex and scTagger on a large dataset of 61 million reads, where decoy barcodes were used to assess demultiplexing accuracy. As scTagger reports multiple barcodes of equi-distance for each read, we assessed its performance by either removing reads with ambiguous reads, or counting any true barcode as a true positive. (G) The number of barcodes recovered across four datasets when no known barcode list was provided. As scTagger does not adjust the produced barcodes to remove empty droplets like the other methods, we used a script provided with Flexiplex, flexiplex-filter, to automatically refine the barcodes based on the end of the inflection point of the read-barcode frequency distribution. (H) The run-time (log scale, four threads) of stand-alone tools for barcode discovery, Flexiplex, BLAZE, and scTagger, as a function of the number of reads processed from the four datasets used for barcode discovery evaluation. See text and Supplementary Material for further details.

References

    1. Berger B, Waterman MS, Yu YW. et al. Levenshtein distance, sequence comparison and biological database search. IEEE Trans Inf Theory 2021;67:3287–94. - PMC - PubMed
    1. Bramlett C, Jiang D, Nogalska A. et al. Clonal tracking using embedded viral barcoding and high-throughput sequencing. Nat Protoc 2020;15:1436–58. - PMC - PubMed
    1. Chen F, Ding K, Priedigkeit N. et al. Single-cell transcriptomic heterogeneity in invasive ductal and lobular breast cancer cells. Cancer Res 2021;81:268–81. - PMC - PubMed
    1. Davidson NM, Majewski IJ, Oshlack A. et al. JAFFA: high sensitivity transcriptome-focused fusion gene detection. Genome Med 2015;7:43. - PMC - PubMed
    1. Dohm JC, Peters P, Stralis-Pavese N. et al. Benchmarking of long-read correction methods. NAR Genom Bioinform 2020;2:lqaa037. - PMC - PubMed

Publication types

Grants and funding