Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Dongze He¹, Mohsen Zakeri², Hirak Sarkar³, Charlotte Soneson^{4

5}, Avi Srivastava⁶, Rob Patro⁷

Affiliations

¹ Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
² Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
³ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁴ Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland.
⁵ SIB Swiss Institute of Bioinformatics, Basel, Switzerland.
⁶ New York Genome Center, New York City, NY, USA.
⁷ Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA. rob@cs.umd.edu.

PMID: 35277707
PMCID: PMC8933848
DOI: 10.1038/s41592-022-01408-3

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Dongze He et al. Nat Methods. 2022 Mar.

. 2022 Mar;19(3):316-322.

doi: 10.1038/s41592-022-01408-3. Epub 2022 Mar 11.

Authors

Dongze He¹, Mohsen Zakeri², Hirak Sarkar³, Charlotte Soneson^{4

5}, Avi Srivastava⁶, Rob Patro⁷

Affiliations

¹ Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
² Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
³ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁴ Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland.
⁵ SIB Swiss Institute of Bioinformatics, Basel, Switzerland.
⁶ New York Genome Center, New York City, NY, USA.
⁷ Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA. rob@cs.umd.edu.

PMID: 35277707
PMCID: PMC8933848
DOI: 10.1038/s41592-022-01408-3

Abstract

The rapid growth of high-throughput single-cell and single-nucleus RNA-sequencing (scRNA-seq and snRNA-seq) technologies has produced a wealth of data over the past few years. The size, volume and distinctive characteristics of these data necessitate the development of new computational methods to accurately and efficiently quantify sc/snRNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying sc/snRNA-seq data. In addition to being faster and more memory frugal than other accurate quantification approaches, alevin-fry ameliorates the memory scalability and false-positive expression issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify sc/snRNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate normal gene expression count matrices.

PubMed Disclaimer

Conflict of interest statement

Competing Interests Statement

RP is a co-founder of Ocean Genomics, inc. The remaining authors declare no competing interests.

Figures

**Fig. 1 – 2-column figure.. Overview of the alevin-fry pipeline**
(operating in unspliced, spliced, ambiguous (USA) quantification mode). The arrows highlight the flow of data through the pipeline, whose output is a matrix specifying the expected counts of each of the considered splicing states of each gene within each quantified cell.

**Fig. 2 – 2-column, 5-panel figure.. Comprehensive analysis of the performance pf alevin-fry on real and simulated datasets.**
**(a)** The frequency distribution of the presence of genes across all shared cells for STARsolo, kallisto|bustools, and alevin-fry (including multiple index types for alevin-fry) on the simulated data. Different color lines represent the quantification methods. Within the variants of alevin-fry, txome stands for transcriptome reference (i.e., just indexing the annotated, spliced, transcriptome), and sketch (pseudoalignment with structural constraints) and sla (selective-alignment) label the mapping method used to obtain the result. Due to the similarity of the distributions, the line of STARsolo is occluded by the line of alevin-fry(splici, sla). **(b)** A visualization of the velocity estimation derived from alevin-fry counts in a UMAP-based embedding after assigning all ambiguous counts as spliced; the streamlines represent the direction of RNA velocity estimated by scVelo. Points (cells) are colored according to the cell type annotation. **(c)** The t-SNE embedding plot of an alevin-fry processed mouse placenta snRNA-seq dataset. The color of each nucleus represents the inferred cell-type annotation, which was learned from a reference dataset. **(d) and (e)** are timing and peak memory usage for all tools (run with 16 threads) on the different datasets evaluated in this manuscript. The x-axis of (d) and (e) represents the evaluated datasets. The y-axis of (d) represents the run time of each tool, measured in seconds. The y-axis of (e) denotes the peak memory usage — measured as the maximum resident set size (max rss) — during the execution of each tool. Dashed horizontal lines in (d) denote 15 minutes, 30 minutes, 60 minutes and 90 minutes, respectively. Dashed horizontal lines in (e) denote 4GB, 8GB, 16GB and 32GB, respectively.

See this image and copyright information in PMC

References

1. Svensson Valentine, Veiga Beltrame Eduardo da, and Pachter Lior. A curated database reveals trends in single-cell transcriptomics. Database, 2020, 2020. - PMC - PubMed
1. Li Bo, Ruotti Victor, Stewart Ron M, Thomson James A, and Dewey Colin N. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, 2010. - PMC - PubMed
1. Bray Nicolas L, Pimentel Harold, Melsted Páll, and Pachter Lior. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5):525, 2016. - PubMed
1. Patro Rob, Duggal Geet, Love Michael I, Irizarry Rafael A, and Kingsford Carl. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4):417–419, 2017. - PMC - PubMed
1. Srivastava Avi, Malik Laraib, Smith Tom, Sudbery Ian, and Patro Rob. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biology, 20(1):1–16, 2019. - PMC - PubMed

References for the Methods Section

1. Patro Rob, Duggal Geet, Love Michael I, Irizarry Rafael A, and Kingsford Carl. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4):417–419, 2017. - PMC - PubMed
1. Almodaresi Fatemeh, Sarkar Hirak, Srivastava Avi, and Patro Rob. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34(13):i169–i177, 2018. - PMC - PubMed
1. Kaminow Benjamin, Yunusov Dinar, and Dobin Alexander. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. BioRxiv, 2021. doi: 10.1101/2021.05.05.442755. - DOI
1. Bray Nicolas L, Pimentel Harold, Melsted Páll, and Pachter Lior. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5):525, 2016. - PubMed
1. Dobin Alexander, Davis Carrie A, Schlesinger Felix, Drenkow Jorg, Zaleski Chris, Jha Sonali, Batut Philippe, Chaisson Mark, and Gingeras Thomas R. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1):15–21, 2013. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Affiliations

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

References for the Methods Section

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous