. 2016 Apr 27;2(4):239-250.

doi: 10.1016/j.cels.2016.04.001. Epub 2016 Apr 27.

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Graham Heimberg^#^{1

2

3}, Rajat Bhatnagar^#^{1

3}, Hana El-Samad^{1

3}, Matt Thomson³

Affiliations

¹ Department of Biochemistry and Biophysics, California Institute for Quantitative Biosciences, University of California, San Francisco, San Francisco, CA 94158, USA.
² Integrative Program in Quantitative Biology, University of California, San Francisco, San Francisco, CA 94158, USA.
³ Center for Systems and Synthetic Biology, University of California, San Francisco, San Francisco, CA 94158, USA.

^# Contributed equally.

PMID: 27135536
PMCID: PMC4856162
DOI: 10.1016/j.cels.2016.04.001

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Graham Heimberg et al. Cell Syst. 2016.

. 2016 Apr 27;2(4):239-250.

doi: 10.1016/j.cels.2016.04.001. Epub 2016 Apr 27.

Authors

Graham Heimberg^#^{1

2

3}, Rajat Bhatnagar^#^{1

3}, Hana El-Samad^{1

3}, Matt Thomson³

Affiliations

¹ Department of Biochemistry and Biophysics, California Institute for Quantitative Biosciences, University of California, San Francisco, San Francisco, CA 94158, USA.
² Integrative Program in Quantitative Biology, University of California, San Francisco, San Francisco, CA 94158, USA.
³ Center for Systems and Synthetic Biology, University of California, San Francisco, San Francisco, CA 94158, USA.

^# Contributed equally.

PMID: 27135536
PMCID: PMC4856162
DOI: 10.1016/j.cels.2016.04.001

Abstract

A tradeoff between precision and throughput constrains all biological measurements, including sequencing-based technologies. Here, we develop a mathematical framework that defines this tradeoff between mRNA-sequencing depth and error in the extraction of biological information. We find that transcriptional programs can be reproducibly identified at 1% of conventional read depths. We demonstrate that this resilience to noise of "shallow" sequencing derives from a natural property, low dimensionality, which is a fundamental feature of gene expression data. Accordingly, our conclusions hold for ∼350 single-cell and bulk gene expression datasets across yeast, mouse, and human. In total, our approach provides quantitative guidelines for the choice of sequencing depth necessary to achieve a desired level of analytical resolution. We codify these guidelines in an open-source read depth calculator. This work demonstrates that the structure inherent in biological networks can be productively exploited to increase measurement throughput, an idea that is now common in many branches of science, such as image processing.

PubMed Disclaimer

Figures

**Figure 1. A mathematical model reveals factors determining the performance of shallow mRNA-seq**
(A) mRNA-seq throughput as a function of sequencing depth per sample for a typical sequencing capacity of 200 million reads. (B) Unsupervised learning techniques are used to identify transcriptional programs. We ask when and why shallow mRNA-seq can accurately identify transcriptional programs. (C) Decreasing sequencing depth adds measurement noise to the transcriptional programs identified by principal component analysis. Our approach reveals that dominant programs, defined as those that explain relatively large variances in the data, are tolerant to measurement noise.

**Figure 2. Transcriptional states of mouse tissues are distinguishable at low read coverage**
(A) Principal component error as a function of read depth for selected principal components for the Shen et al. data. For first three principal components, 1% of the traditional read depth is sufficient for achieving >80% accuracy. Improvements in error exhibit diminishing returns as read depth is increased. Less dominant transcription programs (principal components 8 and 15 shown) are more sensitive to sequencing noise. (B) Variance explained by transcriptional program (blue) and differences between principal values (green) of the Shen et al. data. The leading, dominant transcriptional programs have principal values that are well-separated from later principal values suggesting that these should be more robust to measurement noise. (C) Gene Set Enrichment significance for the top ten terms of principal component two (top) and three (bottom) as a function of read depth. 32,000 reads are sufficient to recover all top ten terms in the first three principal components. (Analysis for first principal component shown in Figure S1D and S1E.) (D) Projection of a subset of the Shen et al. tissue data onto principal components two and three. The ellipses represent uncertainty at specific reads depths. Similar tissues lie close together. Transcriptional program two separates neural tissues from non-neural tissues while transcriptional program three distinguishes tissues involved in haematopoiesis from other tissues. This is consistent with the GSEA of these transcriptional programs in (C).

**Figure 3. Transcriptional states of single cells in the mouse brain are distinguishable at low transcript coverage**
(A) Principal component error as a function of read depth for selected principal components for the Zeisel et al. data. (B) Accuracy of cell type classification as a function of transcripts per cell. Accuracy plateaus with increasing transcript coverage. At 1000 transcripts per cell, all three cell types can be distinguished with low error. At 100 transcripts per cell, pyramidal cells cannot be distinguished from each other, while oligodendrocytes remain distinct. (C Left) Covariance matrix of genes with high absolute loadings in the first principal component. The genes with the 100 highest positive and 100 lowest negative loadings are displayed. (C Middle) First principal component is enriched for genes indicative of oligodendrocytes and neurons. (C Right) Genes significance as a function of transcript count for the first principal component. (D) True and false detection rates as a function of transcript count for genes significantly associated with the first three principal components. Below 100 transcripts per cell, false positives are common.

**Figure 4. Modularity of gene expression enables accurate, low depth transcriptional program identification**
(A) Variance explained and covariance matrix for increasing gene expression covariance in a model. (B) Variance explained by different principal components for the Zeisel et al. data set. (Middle) Covariance matrix shows large modules of covarying genes. (Bottom) Dominant transcriptional programs are robust to low-coverage profiling as predicted by model. Shuffling the dataset destroys the modular structure, resulting in noise-sensitive transcriptional programs. For the shuffled data, 4250 transcripts are required for 80% accuracy of the first three principal components, whereas 340 transcripts suffices for the original dataset.

**Figure 5. Gene expression survey of 352 public datasets reveals broad tolerance of bioinformatics analysis to shallow profiling**
(A, left) Variance explained by the first five transcriptional programs of 352 published yeast, mouse, and human microarray datasets. Shuffling microarray datasets removes gene-gene covariance and destroys the relative dominance of the leading transcriptional programs. (A, right) Read depth required to recover with 80% accuracy the first five principal components of the 352 datasets. Removing gene expression covariance from the data requires a median of ~10 times more reads to achieve the same accuracy. (B) Accuracy of Gene Set Enrichment Analysis of the human microarray datasets at low read depth (100,000 reads, *i.e.* 1% deep depth). Reactome pathway database gene sets are correctly identified (blue) or not identified (yellow) at low read depth (false positives in red). ~80% of gene sets can be correctly recovered at 100,000 reads. (C) Accuracy of Gene Set Enrichment Analysis as a function of read-depth.

**Figure 6. Mathematical framework provides a Read Depth Calculator and guidelines for shallow mRNA-seq experimental design**
(A) Error in the first principal component of the Zeisel et al. dataset for varying cell number and read-depth. Black circles denote a fixed number of total transcripts (100,000). Error can be reduced by either increasing transcript coverage or the number of cells profiled. (B) Number of reads required (color) to achieve a desired error (y-axis) for a given principal value (x-axis). Typical principal values (dashed black vertical lines) are the medians across the 352 gene expression datasets. (C) Error of the Read Depth Calculator (Equation 2) across 176 gene expression datasets used for validation (out of 352 total). The calculator predicts the number of reads to achieve 80% PCA accuracy in each dataset (colored dots). The predicted values closely agree with simulated results, with the median error <10% for the first five transcriptional programs.

See this image and copyright information in PMC

Cited by

Towards a definition of microglia heterogeneity.
Healy LM, Zia S, Plemel JR. Healy LM, et al. Commun Biol. 2022 Oct 20;5(1):1114. doi: 10.1038/s42003-022-04081-6. Commun Biol. 2022. PMID: 36266565 Free PMC article. Review.
Investigating Cellular Trajectories in the Severity of COVID-19 and Their Transcriptional Programs Using Machine Learning Approaches.
Jeong HH, Jia J, Dai Y, Simon LM, Zhao Z. Jeong HH, et al. Genes (Basel). 2021 Apr 24;12(5):635. doi: 10.3390/genes12050635. Genes (Basel). 2021. PMID: 33923155 Free PMC article.
Conditional generative adversarial network for gene expression inference.
Wang X, Ghasedi Dizaji K, Huang H. Wang X, et al. Bioinformatics. 2018 Sep 1;34(17):i603-i611. doi: 10.1093/bioinformatics/bty563. Bioinformatics. 2018. PMID: 30423066 Free PMC article.
Challenges and Opportunities for the Translation of Single-Cell RNA Sequencing Technologies to Dermatology.
Ascensión AM, Araúzo-Bravo MJ, Izeta A. Ascensión AM, et al. Life (Basel). 2022 Jan 4;12(1):67. doi: 10.3390/life12010067. Life (Basel). 2022. PMID: 35054460 Free PMC article. Review.
Thermoregulation via Temperature-Dependent PGD₂ Production in Mouse Preoptic Area.
Wang TA, Teo CF, Åkerblom M, Chen C, Tynan-La Fontaine M, Greiner VJ, Diaz A, McManus MT, Jan YN, Jan LY. Wang TA, et al. Neuron. 2019 Jul 17;103(2):309-322.e7. doi: 10.1016/j.neuron.2019.04.035. Epub 2019 May 28. Neuron. 2019. PMID: 31151773 Free PMC article.

See all "Cited by" articles

References

1. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS. 1999;96:6745–6750. - PMC - PubMed
1. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. PNAS. 2000;97:10101–10106. - PMC - PubMed
1. Bengio Y, Delalleau O, Roux NL, Paiement J-F, Vincent P, Ouimet M. Learning Eigenfunctions Links Spectral Embedding and Kernel PCA. Neural Computation. 2004;16:2197–2219. - PubMed
1. Bergmann S, Ihmels J, Barkai N. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys. 2003;67:031902. - PubMed
1. Bonneau R. Learning biological networks: from modules to dynamics. Nat Chem Biol. 2008;4:658–664. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Affiliations

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources