Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Feb 11;20(1):31.
doi: 10.1186/s13059-019-1639-x.

CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing

Affiliations

CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing

Kenta Sato et al. Genome Biol. .

Abstract

Recent technical improvements in single-cell RNA sequencing (scRNA-seq) have enabled massively parallel profiling of transcriptomes, thereby promoting large-scale studies encompassing a wide range of cell types of multicellular organisms. With this background, we propose CellFishing.jl, a new method for searching atlas-scale datasets for similar cells and detecting noteworthy genes of query cells with high accuracy and throughput. Using multiple scRNA-seq datasets, we validate that our method demonstrates comparable accuracy to and is markedly faster than the state-of-the-art software. Moreover, CellFishing.jl is scalable to more than one million cells, and the throughput of the search is approximately 1600 cells per second.

Keywords: Cell searching; Cell typing; Locality-sensitive hashing; scRNA-seq.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Schematic workflow of CellFishing.jl. CellFishing.jl first builds a database (DB) object that stores data preprocessors, indexed bit vectors, and cell metadata, if provided. The metadata can store any information including cell names, cell types, and transcript expressions of marker genes. When building a database, the DGE matrix of reference cells is preprocessed to extract important signals and then hashed into bit vectors by LSH. The preprocessors and the indexed bit vectors are stored in the database object. M, D, and T on the left side of the figure refer to the number of genes, number of reduced dimensions, and length of the bit vectors, respectively. N and N above the two DGE matrices represent the number of cells within the reference and query data, respectively. While searching the database for similar cells, the prebuilt preprocessors stored in the database are reused in a similar workflow that is involved in database building up to the hashing phase. The database object can be saved onto a disk and can be loaded from there
Fig. 2
Fig. 2
Benchmarks of randomized SVD. a Elapsed time of different SVD algorithms. The blue, orange, and green points indicate the elapsed time of the full, truncated, and randomized SVD, respectively. b Relative errors of the randomized SVD. The error bars denote the standard deviation of ten trials. The relative error of the ith largest singular value σi is defined as 1σ^iσi, where σ^i is an approximated value of σi. The error bars denote the standard deviation of ten trials. The approximation error for a real matrix A with a low-rank matrix is bounded by a singular value as illustrated in the following formula: minrank(X)≤j||AX||=σj+1, where ||·|| denotes the operator norm of a matrix
Fig. 3
Fig. 3
Locality-sensitive hashing of expression profiles (Shekhar2016). a Distributions of estimation errors for angles. The blue lines show the distributions without orthogonalization, and the orange lines show the distributions with orthogonalization. Five hash values were generated independently, and their estimation errors in radian were computed for 100 cells randomly sampled from Shekhar2016. b Scatter plots of the Hamming distance versus the cosine distance. The rows are four different bit lengths (64, 128, 256, and 512 bits) and the columns are six cells randomly sampled from Shekhar2016. The Hamming distance is normalized to [0,1] for comparison across different bit lengths, and the cosine distance is truncated at 0.3 for brevity. c Two-dimensional embedding of expression profiles with UMAP. The upper-left plot was derived from the cosine distances following dimensionality reduction. The other three plots were derived from the Hamming distances after hashing with 64, 128, and 256 bits. Colors indicate the annotations (18 cell types and doublets/contaminants) of Shekhar2016
Fig. 4
Fig. 4
Comparison between CellFishing.jl and scmap-cell by self-mapping. a Consistency scores. b Cohen’s kappa scores. c Index times. d Query times. Each data point corresponds to a fivefold cross-validation. The n-bits and n-lshashes parameter of CellFishing.jl are the number of bits and the number of independent locality-sensitive hashes, respectively (n-bits=128 and n-lshashes=4 are the defaults). The n-centroids-factor and n-features parameter of scmap-cell are the multiplier of the default number of centroids (i.e., N, where N refers to the number of reference cells) parameter and the number of features parameter, respectively (n-centroids-factor=1 and n-features=500 correspond to the defaults)
Fig. 5
Fig. 5
Cluster-specific consistency scores. ad Cluster-specific consistency scores of Baron2016 (human), Shekhar2016, Plass2018, and TabulaMuris (Chromium). The blue and orange points denote the consistency scores of CellFishing.jl and scmap-cell, respectively. Cluster labels are ordered by the cluster size, from top to bottom and then left to right in decreasing order
Fig. 6
Fig. 6
Proportions of cluster assignments for CellFishing.jl and scmap-cell. The rows are labels of query cells and the columns are labels of their nearest neighbor. Cluster labels are ordered by the cluster size, from top to bottom (rows) or left to right (columns) in decreasing order. The values are derived from a cross-validation with the default parameters and normalized row-wise (i.e., the sum of values in a row is 1)
Fig. 7
Fig. 7
Distributions of similarities with or without a specific cell type. a Seven cell types from Shekhar2016. b Seven cell types from TabulaMuris. The x-axis (rank) refers to the rank of the nearest neighbors (e.g., rank=1 refers to the nearest cell of a query), and the y-axis (similarity) refers to the cosine similarity estimated from the Hamming distance. The blue or orange regions denote the distribution of cosine similarities with or without the cell type shown in the title, respectively. The lower dotted lines, dashed lines, and upper dotted lines denote the first, second, and third quartiles, respectively
Fig. 8
Fig. 8
Cell mapping across batches. a Consistency scores of cell mapping (Shekhar2016 and Plass2018). The red lines denote the mean score of the corresponding self-mapping experiment with the same parameters. b Distribution of cluster sizes (Shekhar2016)
Fig. 9
Fig. 9
Cluster assignments across different protocols (TabulaMuris). The rows are the query’s cluster labels (Smart-Seq2) and the columns are its neighbor’s cluster labels (Chromium)
Fig. 10
Fig. 10
Computational costs of different phases. The upper and lower plots show the costs of the linear and index searches, respectively. The elapsed time of each phase was measured using the time_ns function and divided by the number of query cells to compute the average time per cell
Fig. 11
Fig. 11
Data structure indexing subvectors. Given a subvector of the query bit vector, the subindex calculates the locations of the bit vectors in the database that contains the subvector in the same position. The subindex consists of three arrays: filled, offsets, and buckets. The filled array is a bit vector of length 2s, where s is the length of indexed subvectors (in this figure, T=12 and s=4), and supports bit counting in a specific range with a constant time, which is used to calculate the location of the offsets array for a given subvector (highlighted four bits of the query). The offsets and buckets arrays are jointly used to obtain the locations of the database array at which bit vectors with a given subvector are stored

References

    1. Islam S, Kjällquist U, Moliner A, Zajac P, Fan JB, Lönnerberg P, Linnarsson S. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011; 21(7):1160–7. 10.1101/gr.110882.110. - PMC - PubMed
    1. Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification. Cell Rep. 2012; 2(3):666–73. 10.1016/j.celrep.2012.08.003. - PubMed
    1. Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, Taipale J. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012; 9(1):72–4. 10.1038/nmeth.1778. - PubMed
    1. Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lönnerberg P, Linnarsson S. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014; 11(2):163–6. 10.1038/nmeth.2772. - PubMed
    1. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015; 161(5):1187–201. 10.1016/j.cell.2015.04.044. - PMC - PubMed

Publication types

LinkOut - more resources