Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 2;46(W1):W141-W147.
doi: 10.1093/nar/gky421.

CellAtlasSearch: a scalable search engine for single cells

Affiliations

CellAtlasSearch: a scalable search engine for single cells

Divyanshu Srivastava et al. Nucleic Acids Res. .

Abstract

Owing to the advent of high throughput single cell transcriptomics, past few years have seen exponential growth in production of gene expression data. Recently efforts have been made by various research groups to homogenize and store single cell expression from a large number of studies. The true value of this ever increasing data deluge can be unlocked by making it searchable. To this end, we propose CellAtlasSearch, a novel search architecture for high dimensional expression data, which is massively parallel as well as light-weight, thus infinitely scalable. In CellAtlasSearch, we use a Graphical Processing Unit (GPU) friendly version of Locality Sensitive Hashing (LSH) for unmatched speedup in data processing and query. Currently, CellAtlasSearch features over 300 000 reference expression profiles including both bulk and single-cell data. It enables the user query individual single cell transcriptomes and finds matching samples from the database along with necessary meta information. CellAtlasSearch aims to assist researchers and clinicians in characterizing unannotated single cells. It also facilitates noise free, low dimensional representation of single-cell expression profiles by projecting them on a wide variety of reference samples. The web-server is accessible at: http://www.cellatlassearch.com.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
CellAtlasSearch Pipeline. The entire web-server is based on GPU framework. Expression profiles are stored as hash codes obtained through LSH. Like-samples are archived in the same bucket. Query expression data is first converted into hash code and then mapped to one of the buckets. User can query one or more single cell transcriptomes.
Figure 2.
Figure 2.
CellAtlasSearch web application interface. (A) Query submission form, where the user insert database preference, uploads the query file and submits processing request. (B) A custom URL is generated for the result page, even before the result gets compiled. The user can bookmark it for future references. (C) Result page, showing the top hits in a tabular form, with necessary meta information. (D) The interactive summary shows graphical view of the frequently occurring descriptions (or phenotypes) corresponding to each query transcriptome. Each big circle represents a query cell whereas the small ones the corresponding frequently occurring descriptions (or phenotypes). The descriptions are displayed when the bubbles are hovered upon by the cursor. (E) A heat map of cosine similarity values between pairs of query cells and reference samples. (F) Spectral-tSNE plot of the query cells made using cosine similarities as feature variables. Elements (E) and (F) are produced when the query has at least 5 samples.
Figure 3.
Figure 3.
Speed Up and accuracy analysis of CellAtlasSearch. (A) Time taken to generate hash codes with varying sample sizes. The blue curve shows the time taken by our parallelized implementation, which is approximately 10 times lesser than the CPU based serialized version. (B) Gene dropout analysis for randomly chosen cells from the dataset. An accurate match was considered when CellAtlasSearch was able to recover the exact same cell within top five results. (C) Accuracy of CellAtlasSearch in finding samples of the same source cell line from independent studies, upon submission of the query scRNA-seq data of HCT116 cell line, produced by a certain research group. A retrieval is deemed successful if expression data of the same cell line, contributed by an independent group, appears within top five hits. (D) Similar analysis for HEK293T cell line.
Figure 4.
Figure 4.
Dealing with batch-effect. (A) Spectral tSNE based visualization of GM12878 and H1 cells using log transformed scaled counts of genes. For spectral tSNE, top 10 principal components of gene-count matrix were used. For both cell lines, batches are observed to form separate clusters. (B) Visualization based on cosine similarities between queries (GM12878 and H1 cells) and matching references from single cell dataset, as returned by CellAtlasSearch. Here top 10 principal components of cosine similarity matrix were used with tSNE. Cells from different batches of same cell type tend to come closer to each other, yet they did not intermix. (C) Batches of same cell type get intermixed when projection for cosine similarity calculation was done using only GTEx-selected-features in the transcriptomes.

References

    1. Wagner A., Regev A., Yosef N.. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 2016; 34:1145. - PMC - PubMed
    1. Grün D., Lyubimova A., Kester L., Wiebrands K., Basak O., Sasaki N., Clevers H., van Oudenaarden A.. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015; 525:251. - PubMed
    1. Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M.. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161:1202–1214. - PMC - PubMed
    1. Regev A., Teichmann S.A., Lander E.S., Amit I., Benoist C., Birney E., Bodenmiller B., Campbell P., Carninci P., Clatworthy M. et al. . The human cell atlas. Elife. 2017; 6:e27041. - PMC - PubMed
    1. Verhaak R.G., Hoadley K.A., Purdom E., Wang V., Qi Y., Wilkerson M.D., Miller C.R., Ding L., Golub T., Mesirov J.P. et al. . Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010; 17:98–110. - PMC - PubMed

Publication types