Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 27;2(4):239-250.
doi: 10.1016/j.cels.2016.04.001. Epub 2016 Apr 27.

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Affiliations

Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing

Graham Heimberg et al. Cell Syst. .

Abstract

A tradeoff between precision and throughput constrains all biological measurements, including sequencing-based technologies. Here, we develop a mathematical framework that defines this tradeoff between mRNA-sequencing depth and error in the extraction of biological information. We find that transcriptional programs can be reproducibly identified at 1% of conventional read depths. We demonstrate that this resilience to noise of "shallow" sequencing derives from a natural property, low dimensionality, which is a fundamental feature of gene expression data. Accordingly, our conclusions hold for ∼350 single-cell and bulk gene expression datasets across yeast, mouse, and human. In total, our approach provides quantitative guidelines for the choice of sequencing depth necessary to achieve a desired level of analytical resolution. We codify these guidelines in an open-source read depth calculator. This work demonstrates that the structure inherent in biological networks can be productively exploited to increase measurement throughput, an idea that is now common in many branches of science, such as image processing.

PubMed Disclaimer

Figures

Figure 1
Figure 1. A mathematical model reveals factors determining the performance of shallow mRNA-seq
(A) mRNA-seq throughput as a function of sequencing depth per sample for a typical sequencing capacity of 200 million reads. (B) Unsupervised learning techniques are used to identify transcriptional programs. We ask when and why shallow mRNA-seq can accurately identify transcriptional programs. (C) Decreasing sequencing depth adds measurement noise to the transcriptional programs identified by principal component analysis. Our approach reveals that dominant programs, defined as those that explain relatively large variances in the data, are tolerant to measurement noise.
Figure 2
Figure 2. Transcriptional states of mouse tissues are distinguishable at low read coverage
(A) Principal component error as a function of read depth for selected principal components for the Shen et al. data. For first three principal components, 1% of the traditional read depth is sufficient for achieving >80% accuracy. Improvements in error exhibit diminishing returns as read depth is increased. Less dominant transcription programs (principal components 8 and 15 shown) are more sensitive to sequencing noise. (B) Variance explained by transcriptional program (blue) and differences between principal values (green) of the Shen et al. data. The leading, dominant transcriptional programs have principal values that are well-separated from later principal values suggesting that these should be more robust to measurement noise. (C) Gene Set Enrichment significance for the top ten terms of principal component two (top) and three (bottom) as a function of read depth. 32,000 reads are sufficient to recover all top ten terms in the first three principal components. (Analysis for first principal component shown in Figure S1D and S1E.) (D) Projection of a subset of the Shen et al. tissue data onto principal components two and three. The ellipses represent uncertainty at specific reads depths. Similar tissues lie close together. Transcriptional program two separates neural tissues from non-neural tissues while transcriptional program three distinguishes tissues involved in haematopoiesis from other tissues. This is consistent with the GSEA of these transcriptional programs in (C).
Figure 3
Figure 3. Transcriptional states of single cells in the mouse brain are distinguishable at low transcript coverage
(A) Principal component error as a function of read depth for selected principal components for the Zeisel et al. data. (B) Accuracy of cell type classification as a function of transcripts per cell. Accuracy plateaus with increasing transcript coverage. At 1000 transcripts per cell, all three cell types can be distinguished with low error. At 100 transcripts per cell, pyramidal cells cannot be distinguished from each other, while oligodendrocytes remain distinct. (C Left) Covariance matrix of genes with high absolute loadings in the first principal component. The genes with the 100 highest positive and 100 lowest negative loadings are displayed. (C Middle) First principal component is enriched for genes indicative of oligodendrocytes and neurons. (C Right) Genes significance as a function of transcript count for the first principal component. (D) True and false detection rates as a function of transcript count for genes significantly associated with the first three principal components. Below 100 transcripts per cell, false positives are common.
Figure 4
Figure 4. Modularity of gene expression enables accurate, low depth transcriptional program identification
(A) Variance explained and covariance matrix for increasing gene expression covariance in a model. (B) Variance explained by different principal components for the Zeisel et al. data set. (Middle) Covariance matrix shows large modules of covarying genes. (Bottom) Dominant transcriptional programs are robust to low-coverage profiling as predicted by model. Shuffling the dataset destroys the modular structure, resulting in noise-sensitive transcriptional programs. For the shuffled data, 4250 transcripts are required for 80% accuracy of the first three principal components, whereas 340 transcripts suffices for the original dataset.
Figure 5
Figure 5. Gene expression survey of 352 public datasets reveals broad tolerance of bioinformatics analysis to shallow profiling
(A, left) Variance explained by the first five transcriptional programs of 352 published yeast, mouse, and human microarray datasets. Shuffling microarray datasets removes gene-gene covariance and destroys the relative dominance of the leading transcriptional programs. (A, right) Read depth required to recover with 80% accuracy the first five principal components of the 352 datasets. Removing gene expression covariance from the data requires a median of ~10 times more reads to achieve the same accuracy. (B) Accuracy of Gene Set Enrichment Analysis of the human microarray datasets at low read depth (100,000 reads, i.e. 1% deep depth). Reactome pathway database gene sets are correctly identified (blue) or not identified (yellow) at low read depth (false positives in red). ~80% of gene sets can be correctly recovered at 100,000 reads. (C) Accuracy of Gene Set Enrichment Analysis as a function of read-depth.
Figure 6
Figure 6. Mathematical framework provides a Read Depth Calculator and guidelines for shallow mRNA-seq experimental design
(A) Error in the first principal component of the Zeisel et al. dataset for varying cell number and read-depth. Black circles denote a fixed number of total transcripts (100,000). Error can be reduced by either increasing transcript coverage or the number of cells profiled. (B) Number of reads required (color) to achieve a desired error (y-axis) for a given principal value (x-axis). Typical principal values (dashed black vertical lines) are the medians across the 352 gene expression datasets. (C) Error of the Read Depth Calculator (Equation 2) across 176 gene expression datasets used for validation (out of 352 total). The calculator predicts the number of reads to achieve 80% PCA accuracy in each dataset (colored dots). The predicted values closely agree with simulated results, with the median error <10% for the first five transcriptional programs.

Similar articles

Cited by

References

    1. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS. 1999;96:6745–6750. - PMC - PubMed
    1. Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. PNAS. 2000;97:10101–10106. - PMC - PubMed
    1. Bengio Y, Delalleau O, Roux NL, Paiement J-F, Vincent P, Ouimet M. Learning Eigenfunctions Links Spectral Embedding and Kernel PCA. Neural Computation. 2004;16:2197–2219. - PubMed
    1. Bergmann S, Ihmels J, Barkai N. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys. 2003;67:031902. - PubMed
    1. Bonneau R. Learning biological networks: from modules to dynamics. Nat Chem Biol. 2008;4:658–664. - PubMed

Publication types