Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Dec 1;21(12):e1013692.
doi: 10.1371/journal.pcbi.1013692. eCollection 2025 Dec.

Long-read sequencing transcriptome quantification with lr-kallisto

Affiliations

Long-read sequencing transcriptome quantification with lr-kallisto

Rebekah K Loving et al. PLoS Comput Biol. .

Abstract

RNA abundance quantification has become routine and affordable thanks to high-throughput "short-read" technologies that provide accurate molecule counts at the gene level. Similarly accurate and affordable quantification of definitive full-length, transcript isoforms has remained a stubborn challenge, despite its obvious biological significance across a wide range of problems. "Long-read" sequencing platforms now produce data-types that can, in principle, drive routine definitive isoform quantification. However some particulars of contemporary long-read datatypes, together with isoform complexity and genetic variation, present bioinformatic challenges. We show here, using ONT data, that fast and accurate quantification of long-read data is possible and that it is improved by exome capture. To perform quantifications we developed lr-kallisto, which adapts the kallisto bulk and single-cell RNA-seq quantification methods for long-read technologies.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. lr-kallisto demonstrates high concordance between Illumina and ONT.
(a) Experimental overview for comparison of exome capture vs. non-exome capture LR-Split-seq libraries. (b) Kernel density estimations for read length distributions by capture strategy. (c) Percentage of demultiplexed reads by number of exons in each read between exome and non-exome capture. (d-g) Each point is a hexbin representing the number of transcript in the bin with expression in log2(TPM) with x-coordinate quantified from long reads and y-coordinate quantified from short reads. Total number of points is the total number of annotated transcripts in the reference transcriptome. CCC is a measure of how close the data is to x = y, while Pearson R and Spearman ρ are measures of correlation between x and y. (d) lr-kallisto pseudobulk quantifications of exome capture for the C57BL/6J sample. (e) lr-kallisto pseudobulk quantifications of exome capture for the CAST/Eij sample. (f) lr-kallisto pseudobulk quantifications of non-exome capture for the C57BL/6J sample. (g) lr-kallisto pseudobulk quantifications of non-exome capture for the CAST/Eij sample. Concordance Correlation Coefficient (CCC), Pearson, and Spearman correlations are shown for each comparison. Created with https://BioRender.com
Fig 2
Fig 2. Comparison of Bambu, IsoQuant, lr-kallisto, and Oarfish in (a) abundance estimates as measured by CCC of expression and (b) variability between isoforms as measured by CCC of isoform CV2, with 90% CI to measure consistency and reproducibility among replicates between the tools.
Fig 3
Fig 3. lr-kallisto is highly accurate in simulations with error up to ∼3%.
A comparison of performance of Bambu, IsoQuant, lr-kallisto, and Oarfish on PacBio (top) and ONT (bottom) simulations with Concordance Correlation Coefficient (CCC), Normalized Root Mean Squared Error, and Pearson’s and Spearman’s correlation coefficients reported.
Fig 4
Fig 4. Overview of biosample to lr-kallisto pipeline for long read RNA sequencing.
To study the complexity of life, we can study the genome, transcriptome, and proteome. Through long read sequencing, we can achieve greater insight into both the workings of the genome and the proteome at the individual level and even the functionality of RNA as a molecule. Therefore, improving our ability to analyze long read RNA sequences increases our understanding of biology itself. 1. RNA is extracted from cells and tissues in either single-cell, single-nucleus, or bulk preparation of RNA creating an RNA sequencing library. 2. The RNA sequencing library is then sequenced with either PacBio or Oxford Nanopore Sequencing (Nanopore illustration shown). 3. The raw electrical signal from the nanopore or the raw fluorescent signal from PacBio is then basecalled to create the raw RNA sequenced reads. 4. The raw RNA sequenced reads are input to lr-kallisto outputting both transcriptome quantification of the tissue or single- cells or nuclei as well as the pseudobam alignments for the reads. 5. The analysis and visualization of lr-kallisto’s outputs: single-cell or bulk transcript and gene count matrices and pseudobam (pseudoalignments are output in bam format). Created with https://BioRender.com
Fig 5
Fig 5. Overview of lr-kallisto pseudoalignment algorithm.
The input consists of a reference transcriptome and reads from a long read RNA sequencing experiment. (A) An example of two reads (blue and green with unmapping regions (black) and erroneously mapped regions (purple)) and three (pink, blue, and green) overlapping transcripts. (B) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces transcript compatibility class (TCC) for each k-mer. (C) Conceptually, the k-mers of a read are hashed (black nodes) to find the TCC of a read. (D) The TCC of the read is determined by taking the intersection of the transcript compatibility classes of its constituent k-mers, if it exists; otherwise, the mode of the TCCs of the k-mers of the read is taken. Created with https://BioRender.com

Update of

References

    1. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21(1):30. doi: 10.1186/s13059-020-1935-5 - DOI - PMC - PubMed
    1. Pardo-Palacios FJ, Wang D, Reese F, Diekhans M, Carbonell-Sala S, Williams B, et al. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. bioRxiv. 2023:2023.07.25.550582. doi: 10.1101/2023.07.25.550582 - DOI - PMC - PubMed
    1. Reese F, Williams B, Balderrama-Gutierrez G, Wyman D, Çelik MH, Rebboah E, et al. The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity. bioRxiv. 2023:2023.05.15.540865. doi: 10.1101/2023.05.15.540865 - DOI - PMC - PubMed
    1. Sakamoto Y, Sereewattanawoot S, Suzuki A. A new era of long-read sequencing for cancer genomics. J Hum Genet. 2020;65(1):3–10. doi: 10.1038/s10038-019-0658-5 - DOI - PMC - PubMed
    1. Wang C, Shi Z, Huang Q, Liu R, Su D, Chang L, et al. Single-cell analysis of isoform switching and transposable element expression during preimplantation embryonic development. PLoS Biol. 2024;22(2):e3002505. doi: 10.1371/journal.pbio.3002505 - DOI - PMC - PubMed

MeSH terms

LinkOut - more resources