Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 6;24(1):66.
doi: 10.1186/s13059-023-02907-y.

Identification of cell barcodes from long-read single-cell RNA-seq with BLAZE

Affiliations

Identification of cell barcodes from long-read single-cell RNA-seq with BLAZE

Yupei You et al. Genome Biol. .

Abstract

Long-read single-cell RNA sequencing (scRNA-seq) enables the quantification of RNA isoforms in individual cells. However, long-read scRNA-seq using the Oxford Nanopore platform has largely relied upon matched short-read data to identify cell barcodes. We introduce BLAZE, which accurately and efficiently identifies 10x cell barcodes using only nanopore long-read scRNA-seq data. BLAZE outperforms the existing tools and provides an accurate representation of the cells present in long-read scRNA-seq when compared to matched short reads. BLAZE simplifies long-read scRNA-seq while improving the results, is compatible with downstream tools accepting a cell barcode file, and is available at https://github.com/shimlab/BLAZE .

PubMed Disclaimer

Conflict of interest statement

Y.Y, Y.D.P, R.D.P, and M.B.C have received support from Oxford Nanopore Technologies (ONT) to present their findings at scientific conferences. However, ONT played no role in the study design, execution, analysis, or publication.

Figures

Fig. 1
Fig. 1
Experimental overview and comparison of identified cell barcodes. A BLAZE Workflow. Step 1: locate putative barcodes by first locating the adaptor in each read. Putative barcodes include those originating from different cells and empty droplets. In the schematic, putative barcodes with the same color come from the same original cell/droplet. Black blocks on putative barcodes represent basecalling errors. Step 2: select high-quality putative barcodes. Bases representing sequencing errors tend to have low quality scores. Putative barcodes with minQ < 15 are filtered out (faded in the figure) and the majority of the remaining putative barcodes are expected to have no errors. Step 3: identify cell-associated barcodes. BLAZE counts and ranks unique high-quality putative barcodes and outputs a list of cell-associated barcodes whose counts pass a quantile-based threshold. B Schematic of experimental design. Human induced pluripotent stem cells (hiPSC) undergoing cortical neuronal differentiation were dissociated into a single-cell suspension and processed to generate single-cell full-length cDNA. Full-length cDNA was sequenced using both short and long-read methods and barcode whitelists generated using Cell Ranger, BLAZE, and Sockeye followed by gene and isoform quantification and clustering. Three nanopore sequencing runs were performed on the same cDNA sample, a higher-depth PromethION run, a lower-depth GridION run, and a higher accuracy run using the Q20 protocol on the GridION. C Barcode upset plot comparing the different whitelists. The bar chart on the left shows the total number of barcodes found by each tool. The bar chart on the top shows the number of barcodes in the intersection of whitelists from specific combinations of methods. The dots and lines underneath show the combinations. The colors of the combinations are used to distinguish barcodes in Fig. 1D. D Barcode rank plot. Unique barcodes are ranked based on the counts output by each method and colored by which method(s) included each barcode in their barcode whitelist(s). The colors for different combinations of methods follow those in C, and barcodes not included in any of the whitelists are in gray. Cell Ranger short-read counts, Sockeye long-read counts, and BLAZE long-read counts shown on left, middle, and right knee plots, respectively. Sockeye and BLAZE analyze the same dataset. Cell Ranger analyzes counts from a short-read library, deriving from the same original cDNA. Unique barcodes are ranked on the x-axis based on the number of reads/unique molecules observed for each (y-axis). Shifts on the x-axis are intentionally added to make the dots with different colors non-overlapping. Note that these three methods generate counts in different ways so the three plots have different y-axis labels
Fig. 2
Fig. 2
Comparison of cell clusters identified with BLAZE, Cell Ranger, and Sockeye barcodes. Isoform expression UMAP plots from PromethION data. Isoform counts were generated with FLAMES using barcode whitelists from either Cell Ranger, BLAZE, or Sockeye. A Cells in all three plots are colored based on clustering with the Cell Ranger whitelist. Cells not found in the Cell Ranger whitelist are colored in gray. B Cells colored based on UMI counts (sum of all unique UMIs across all transcripts) per cell. C Cells that are empty droplets colored in blue. D Sockeye UMAP colored based on edit distance ≤ 2 or empty droplet
Fig. 3
Fig. 3
Gene expression UMAP colored by cluster and expression of marker genes. A UMAP showing clustering based on gene counts generated from FLAMES using the BLAZE whitelist. B UMAP colored by the expression of 4 marker genes known to be associated with differentiation and neuron development. The expression scale is colored based on Seurat normalized counts. Color scales are not comparable between the plots
Fig. 4
Fig. 4
Isoform expression UMAP plot from Q20 and GridION data. A Q20. B GridION LSK110. Isoform counts were generated with FLAMES using barcode whitelists from either Cell Ranger, BLAZE, or Sockeye. Cells are colored as per Fig. 2A
Fig. 5
Fig. 5
Barcode identification and clustering of Scmixology2 data. A Barcode upset plot comparing different whitelists. The bar chart on left shows the total number of barcodes found by each tool. Bar chart on top shows the number of barcodes in the intersection of whitelists from specific combinations of methods. The dots and lines underneath show the combinations. BD Isoform expression UMAP plots: Isoform counts were generated with FLAMES using a barcode whitelist from either Cell Ranger (left), BLAZE (middle), or Sockeye (right). Cells are colored based on known cell types from Tian et al. [15] (B), total UMIs per cell (C), number of isoforms detected in each cell (D), and cells that are empty droplets (E). F Sockeye UMAP colored based on edit distance ≤ 2 or empty droplet
Fig. 6
Fig. 6
UMAP plots from PromethION data with BLAZE high sensitivity (HS) mode. Counts were generated with FLAMES using barcode whitelists from either Cell Ranger, Sockeye, or BLAZE HS. Isoform expression UMAP colored by Cell Ranger clusters: Empty droplets were removed prior to clustering
Fig. 7
Fig. 7
Precision-recall curves across A real and B simulated datasets for BLAZE and Sockeye. Precision and recall were calculated across different count thresholds by defining the barcodes identified from short reads as the ground truth, specifically the whitelist from Cell Ranger after the removal of empty droplets (A) and data simulated to the Cell Ranger whitelist to make it a perfect ground truth (B). The numbers in the legend show the area under the curve (AUC) values

Similar articles

Cited by

References

    1. Han X, Zhou Z, Fei L, Sun H, Wang R, Chen Y, Chen H, Wang J, Tang H, Ge W, et al. Construction of a human cell landscape at single-cell level. Nature. 2020;581:303–309. - PubMed
    1. Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8:14049. - PMC - PubMed
    1. Arzalluz-Luque Á, Conesa A. Single-cell RNAseq for the study of isoforms—how is that possible? Genome Biol. 2018;19:110. - PMC - PubMed
    1. Hagemann-Jensen M, Ziegenhain C, Chen P, Ramsköld D, Hendriks G-J, Larsson AJM, Faridani OR, Sandberg R. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nat Biotechnol. 2020;38:708–714. - PubMed
    1. De Paoli-Iseppi R, Gleeson J, Clark MB. Isoform age - splice isoform profiling using long-read technologies. Front Mol Biosci. 2021;8:711733. - PMC - PubMed

Publication types

MeSH terms