Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jan 11:2023.01.10.523448.
doi: 10.1101/2023.01.10.523448.

Deconfounded Dimension Reduction via Partial Embeddings

Affiliations

Deconfounded Dimension Reduction via Partial Embeddings

Andrew A Chen et al. bioRxiv. .

Abstract

Dimension reduction tools preserving similarity and graph structure such as t-SNE and UMAP can capture complex biological patterns in high-dimensional data. However, these tools typically are not designed to separate effects of interest from unwanted effects due to confounders. We introduce the partial embedding (PARE) framework, which enables removal of confounders from any distance-based dimension reduction method. We then develop partial t-SNE and partial UMAP and apply these methods to genomic and neuroimaging data. Our results show that the PARE framework can remove batch effects in single-cell sequencing data as well as separate clinical and technical variability in neuroimaging measures. We demonstrate that the PARE framework extends dimension reduction methods to highlight biological patterns of interest while effectively removing confounding effects.

Keywords: Dimension reduction; confounding effects; embeddings; genomics; neuroimaging.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Embeddings and partial embeddings of single-cell RNA-sequencing measurements from 13,369 human pancreatic cells across four studies.
The original counts data is log-normalized and reduced to 2,000 highly variable genes. Local Simpson’s index is computed for each cell for batch (bLISI) and cell type (cLISI) with the median, 2.5% quantile, and 97.5% quantile shown. Higher bLISI indicates greater integration across batches and lower cLISI indicates greater separation between cell types. Partial t-SNE (pt-SNE) and partial UMAP (p-UMAP) adjust for either batch or donor effects. We compare our new methodology to the existing projected t-SNE for batch correction (BC-t-SNE). All t-SNE embeddings have a perplexity of 10 and UMAP embeddings use 15 nearest neighbors.
Figure 2:
Figure 2:. Local Simpson’s index for batch (bLISI) and cell type (cLISI) across multiple perplexity values.
LISI is computed using distances from the embeddings. The original embeddings and partial embeddings are compared across perplexity values, which capture different neighborhood sizes around each cell.
Figure 3:
Figure 3:. Single-cell RNA-sequencing data embeddings and partial embeddings with respect to batch across varying numbers of principal coordinates.
Each dimension reduction method takes a subset of adjusted or unadjusted principal coordinates. The dimension of this subset is varied across figure columns. Partial t-SNE (p-t-SNE) and partial UMAP (p-UMAP) adjust for batches.
Figure 4:
Figure 4:. Application of partial embeddings to brain cortical thickness measurements (a) and regional volumes (b) across two neuroimaging studies.
(a) visualizes cortical thickness data from the Alzheimer’s Disease Neuroimaging Initiative, from which we include 505 participants. These participants are diagnosed as cognitively normal (CN), having late mild cognitive impairment (LMCI), or having Alzheimer’s disease (AD). Participants are acquired across many scanners with three distinct manufacturers. (b) shows results from a traveling subjects study of eleven multiple sclerosis (MS) patients with multiple images across four study sites. The Hopkins site uses a Philips scanner while the three other sites use Siemens scanners.

References

    1. Aliverti E., Tilson J. L., Filer D. L., Babcock B., Colaneri A., Ocasio J., Gershon T. R., Wilhelmsen K. C., and Dunson D. B. (2020). Projected t-SNE for batch correction. Bioinformatics (Oxford, England), 36(11):3522–3527. - PMC - PubMed
    1. Amid E. and Warmuth M. K. (2022). TriMap: Large-scale Dimensionality Reduction Using Triplets. arXiv:1910.00204 [cs, stat].
    1. Baron M., Veres A., Wolock S. L., Faust A. L., Gaujoux R., Vetere A., Ryu J. H., Wagner B. K., Shen-Orr S. S., Klein A. M., Melton D. A., and Yanai I. (2016). A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Systems, 3(4):346–360.e4. - PMC - PubMed
    1. Beer J. C., Tustison N. J., Cook P. A., Davatzikos C., Sheline Y. I., Shinohara R. T., and Linn K. A. (2020). Longitudinal ComBat: A method for harmonizing longitudinal multi-scanner imaging data. NeuroImage, 220:117129. - PMC - PubMed
    1. Belkin M. and Niyogi P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15(6):1373–1396.

Publication types

LinkOut - more resources