. 2019 Feb 11;20(1):31.

doi: 10.1186/s13059-019-1639-x.

CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing

Kenta Sato^{1

2}, Koki Tsuyuzaki³, Kentaro Shimizu⁴, Itoshi Nikaido^{5

6}

Affiliations

¹ Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo, 113-8657, Japan. bicycle1885@gmail.com.
² Laboratory for Bioinformatics Research RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan. bicycle1885@gmail.com.
³ Laboratory for Bioinformatics Research RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan.
⁴ Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo, 113-8657, Japan.
⁵ Laboratory for Bioinformatics Research RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan. itoshi.nikaido@riken.jp.
⁶ Bioinformatics Course, Master's/Doctoral Program in Life Science Innovation (T-LSI), School of Integrative and Global Majors (SIGMA), University of Tsukuba, Wako, Saitama, 351-0198, Japan. itoshi.nikaido@riken.jp.

PMID: 30744683
PMCID: PMC6371477
DOI: 10.1186/s13059-019-1639-x

CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing

Kenta Sato et al. Genome Biol. 2019.

. 2019 Feb 11;20(1):31.

doi: 10.1186/s13059-019-1639-x.

Authors

Kenta Sato^{1

2}, Koki Tsuyuzaki³, Kentaro Shimizu⁴, Itoshi Nikaido^{5

6}

Affiliations

¹ Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo, 113-8657, Japan. bicycle1885@gmail.com.
² Laboratory for Bioinformatics Research RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan. bicycle1885@gmail.com.
³ Laboratory for Bioinformatics Research RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan.
⁴ Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo, 113-8657, Japan.
⁵ Laboratory for Bioinformatics Research RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, 351-0198, Japan. itoshi.nikaido@riken.jp.
⁶ Bioinformatics Course, Master's/Doctoral Program in Life Science Innovation (T-LSI), School of Integrative and Global Majors (SIGMA), University of Tsukuba, Wako, Saitama, 351-0198, Japan. itoshi.nikaido@riken.jp.

PMID: 30744683
PMCID: PMC6371477
DOI: 10.1186/s13059-019-1639-x

Abstract

Recent technical improvements in single-cell RNA sequencing (scRNA-seq) have enabled massively parallel profiling of transcriptomes, thereby promoting large-scale studies encompassing a wide range of cell types of multicellular organisms. With this background, we propose CellFishing.jl, a new method for searching atlas-scale datasets for similar cells and detecting noteworthy genes of query cells with high accuracy and throughput. Using multiple scRNA-seq datasets, we validate that our method demonstrates comparable accuracy to and is markedly faster than the state-of-the-art software. Moreover, CellFishing.jl is scalable to more than one million cells, and the throughput of the search is approximately 1600 cells per second.

Keywords: Cell searching; Cell typing; Locality-sensitive hashing; scRNA-seq.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Schematic workflow of CellFishing.jl. CellFishing.jl first builds a database (DB) object that stores data preprocessors, indexed bit vectors, and cell metadata, if provided. The metadata can store any information including cell names, cell types, and transcript expressions of marker genes. When building a database, the DGE matrix of reference cells is preprocessed to extract important signals and then hashed into bit vectors by LSH. The preprocessors and the indexed bit vectors are stored in the database object. M, D, and T on the left side of the figure refer to the number of genes, number of reduced dimensions, and length of the bit vectors, respectively. N and N^′ above the two DGE matrices represent the number of cells within the reference and query data, respectively. While searching the database for similar cells, the prebuilt preprocessors stored in the database are reused in a similar workflow that is involved in database building up to the hashing phase. The database object can be saved onto a disk and can be loaded from there

**Fig. 2**
Benchmarks of randomized SVD. a Elapsed time of different SVD algorithms. The *blue*, *orange*, and *green points* indicate the elapsed time of the full, truncated, and randomized SVD, respectively. b Relative errors of the randomized SVD. The *error bars* denote the standard deviation of ten trials. The relative error of the ith largest singular value σ_i is defined as $|1 - \frac{{\hat{σ}}_{i}}{σ_{i}}|$ , where ${\hat{σ}}_{i}$ is an approximated value of σ_i. The error bars denote the standard deviation of ten trials. The approximation error for a real matrix A with a low-rank matrix is bounded by a singular value as illustrated in the following formula: minrank(X)≤j||A−X||=σ_j+1, where ||·|| denotes the operator norm of a matrix

**Fig. 3**
Locality-sensitive hashing of expression profiles (Shekhar2016). a Distributions of estimation errors for angles. The *blue lines* show the distributions without orthogonalization, and the *orange lines* show the distributions with orthogonalization. Five hash values were generated independently, and their estimation errors in radian were computed for 100 cells randomly sampled from Shekhar2016. b Scatter plots of the Hamming distance versus the cosine distance. The rows are four different bit lengths (64, 128, 256, and 512 bits) and the columns are six cells randomly sampled from Shekhar2016. The Hamming distance is normalized to [0,1] for comparison across different bit lengths, and the cosine distance is truncated at 0.3 for brevity. c Two-dimensional embedding of expression profiles with UMAP. The upper-left plot was derived from the cosine distances following dimensionality reduction. The other three plots were derived from the Hamming distances after hashing with 64, 128, and 256 bits. Colors indicate the annotations (18 cell types and doublets/contaminants) of Shekhar2016

**Fig. 4**
Comparison between CellFishing.jl and scmap-cell by self-mapping. a Consistency scores. b Cohen’s kappa scores. c Index times. d Query times. Each data point corresponds to a fivefold cross-validation. The n-bits and n-lshashes parameter of CellFishing.jl are the number of bits and the number of independent locality-sensitive hashes, respectively (n-bits=128 and n-lshashes=4 are the defaults). The n-centroids-factor and n-features parameter of scmap-cell are the multiplier of the default number of centroids (i.e., $⌊ \sqrt{N} ⌋$ , where N refers to the number of reference cells) parameter and the number of features parameter, respectively (n-centroids-factor=1 and n-features=500 correspond to the defaults)

**Fig. 5**
Cluster-specific consistency scores. a–d Cluster-specific consistency scores of Baron2016 (human), Shekhar2016, Plass2018, and TabulaMuris (Chromium). The *blue* and *orange points* denote the consistency scores of CellFishing.jl and scmap-cell, respectively. Cluster labels are ordered by the cluster size, from top to bottom and then left to right in decreasing order

**Fig. 6**
Proportions of cluster assignments for CellFishing.jl and scmap-cell. The rows are labels of query cells and the columns are labels of their nearest neighbor. Cluster labels are ordered by the cluster size, from top to bottom (rows) or left to right (columns) in decreasing order. The values are derived from a cross-validation with the default parameters and normalized row-wise (i.e., the sum of values in a row is 1)

**Fig. 7**
Distributions of similarities with or without a specific cell type. a Seven cell types from Shekhar2016. b Seven cell types from TabulaMuris. The x-axis (*rank*) refers to the rank of the nearest neighbors (e.g., rank=1 refers to the nearest cell of a query), and the y-axis (*similarity*) refers to the cosine similarity estimated from the Hamming distance. The *blue* or *orange* regions denote the distribution of cosine similarities with or without the cell type shown in the title, respectively. The *lower dotted lines*, *dashed lines*, and *upper dotted lines* denote the first, second, and third quartiles, respectively

**Fig. 8**
Cell mapping across batches. a Consistency scores of cell mapping (Shekhar2016 and Plass2018). The *red lines* denote the mean score of the corresponding self-mapping experiment with the same parameters. b Distribution of cluster sizes (Shekhar2016)

**Fig. 9**
Cluster assignments across different protocols (TabulaMuris). The rows are the query’s cluster labels (Smart-Seq2) and the columns are its neighbor’s cluster labels (Chromium)

**Fig. 10**
Computational costs of different phases. The upper and lower plots show the costs of the linear and index searches, respectively. The elapsed time of each phase was measured using the time_ns function and divided by the number of query cells to compute the average time per cell

**Fig. 11**
Data structure indexing subvectors. Given a subvector of the query bit vector, the subindex calculates the locations of the bit vectors in the database that contains the subvector in the same position. The subindex consists of three arrays: *filled*, *offsets*, and *buckets*. The *filled* array is a bit vector of length 2^s, where s is the length of indexed subvectors (in this figure, T=12 and s=4), and supports bit counting in a specific range with a constant time, which is used to calculate the location of the *offsets* array for a given subvector (highlighted four bits of the query). The *offsets* and *buckets* arrays are jointly used to obtain the locations of the *database* array at which bit vectors with a given subvector are stored

See this image and copyright information in PMC

References

1. Islam S, Kjällquist U, Moliner A, Zajac P, Fan JB, Lönnerberg P, Linnarsson S. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 2011; 21(7):1160–7. 10.1101/gr.110882.110. - PMC - PubMed
1. Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: Single-Cell RNA-Seq by Multiplexed Linear Amplification. Cell Rep. 2012; 2(3):666–73. 10.1016/j.celrep.2012.08.003. - PubMed
1. Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S, Taipale J. Counting absolute numbers of molecules using unique molecular identifiers. Nat Methods. 2012; 9(1):72–4. 10.1038/nmeth.1778. - PubMed
1. Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lönnerberg P, Linnarsson S. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat Methods. 2014; 11(2):163–6. 10.1038/nmeth.2772. - PubMed
1. Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015; 161(5):1187–201. 10.1016/j.cell.2015.04.044. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing

Affiliations

CellFishing.jl: an ultrafast and scalable cell search method for single-cell RNA sequencing

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources