Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 27:9:e51254.
doi: 10.7554/eLife.51254.

Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments

Affiliations

Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments

Christopher A Jackson et al. Elife. .

Abstract

Understanding how gene expression programs are controlled requires identifying regulatory relationships between transcription factors and target genes. Gene regulatory networks are typically constructed from gene expression data acquired following genetic perturbation or environmental stimulus. Single-cell RNA sequencing (scRNAseq) captures the gene expression state of thousands of individual cells in a single experiment, offering advantages in combinatorial experimental design, large numbers of independent measurements, and accessing the interaction between the cell cycle and environmental responses that is hidden by population-level analysis of gene expression. To leverage these advantages, we developed a method for scRNAseq in budding yeast (Saccharomyces cerevisiae). We pooled diverse transcriptionally barcoded gene deletion mutants in 11 different environmental conditions and determined their expression state by sequencing 38,285 individual cells. We benchmarked a framework for learning gene regulatory networks from scRNAseq data that incorporates multitask learning and constructed a global gene regulatory network comprising 12,228 interactions.

Keywords: S. cerevisiae; computational biology; gene regulatory networks; single cell RNA sequencing; systems biology; transcription factors.

Plain language summary

Organisms switch their genes on and off to adapt to changing environments. This takes place thanks to complex networks of regulators that control which genes are actively ‘read’ by the cell to create the RNA molecules that are needed at the time. Piecing together these networks is key to fully understand the inner workings of living organisms, and how to potentially modify or artificially create them. Single-cell RNA sequencing is a powerful new tool that can measure which genes are turned on (or ‘expressed’) in an individual cell. Datasets with millions of gene expression profiles for individual cells now exist for organisms such as mice or humans. Yet, it is difficult to use these data to reconstruct networks of regulators; this is partly because scientists are not sure if the computational methods normally used to build these networks also work for single-cell RNA sequencing data. One way to check if this is the case is to use the methods on single-cell datasets from organisms where the networks of regulators are already known, and check whether the computational tools help to reach the same conclusion. Unfortunately, the regulatory networks in the organisms for which scientists have a lot of single-cell RNA sequencing data are still poorly known. There are living beings in which the networks are well characterised – such as yeast – but it has been difficult to do single-cell sequencing in them at the scale seen in other organisms. Jackson, Castro et al. first adapted a system for single-cell sequencing so that it would work in yeast. This generated a gene expression dataset of over 40,000 yeast cells. They then used a computational method (called the Inferelator) on these data to construct networks of regulators, and the results showed that the method performed well. This allowed Jackson, Castro et al. to start mapping how different networks connect, for example those that control the response to the environment and cell division. This is one of the benefits of single-cell RNA methods: cell division for example is not a process that can be examined at the level of a population, since the cells may all be at different life stages. In the future, the dataset will also be useful to scientists to benchmark a variety of single cell computational tools.

PubMed Disclaimer

Conflict of interest statement

CJ, DC, GS, RB, DG No competing interests declared

Figures

Figure 1.
Figure 1.. Single-Cell RNA-Seq Experimental Workflow in Saccharomyces Cerevisiae.
Schematic workflows for: (A) Growth of a transcriptionally-barcoded pool of 11 nitrogen metabolism transcription factor (TF) knockout strains and a wild-type control strain each analyzed with six biological replicates (B) Synthesis in microfluidic droplets of single-cell cDNA with a cell-specific index sequence (IDX) attached to the oligo-dT primer, and a common template switch oligo (TSO). cDNA is processed for whole-transcriptome libraries, to quantify gene expression. In parallel, PCR products are amplified containing the genotype-specific transcriptional barcode (BC) encoded on the KanR antibiotic resistance marker mRNA, to identify cell genotype. Expression DNA libraries and PCR products are separately indexed for multiplexed sequencing (C) Processing of single-cell sequencing data using Unique Molecular Identifiers (UMI) into a count matrix which is used to learn a gene regulatory network using multi-task network inference from several different growth conditions.
Figure 1—figure supplement 1.
Figure 1—figure supplement 1.. Strain Construction Workflow and Validation.
(A) Construction of KanMX[BC] deletion cassettes containing degenerate barcodes in the 3’ untranslated region, followed by transformation into yeast to create a transcription factor knockout array (B) Spotting of the deletion array onto YPD (control) and media containing different nitrogen sources plates. Each column is a separate TF deletion, and each spot is a uniquely barcoded biological replicate. There are six biological replicates for each of 12 genotypes, for a total of 72 unique strains in the array.
Figure 2.
Figure 2.. Gene expression of single Yeast Cells Cluster Based on Environmental Growth Condition.
(A) Normalized density histograms of raw UMI counts of the core glycolytic genes ENO2 and PDC1, and the alcohol respiration gene ADH2 in each environmental growth condition. Mean UMI count for each of the 12 different strain genotypes within each growth condition are plotted as dots on the X axis. (B–C) Uniform Manifold Approximation and Projection (UMAP) projection of log-transformed and batch-normalized scRNAseq data. Axes are dimensionless variables V1 and V2 with arbitrary units, here omitted. Individual cells are colored by environmental growth condition (B) or by strain genotype (C). Growth conditions are abbreviated as in Table 1.
Figure 2—figure supplement 1.
Figure 2—figure supplement 1.. Quality Control of Single-Cell RNA Sequencing Data.
(A) The number of cells that pass all quality control and preprocessing filters for each growth condition. Each genotype has multiple independently barcoded biological replicates that are plotted separately within each condition. The mean number of cells for each genotype within a condition is plotted as a horizontal line. (B) The mean count of unique transcripts, determined by UMI, for each growth condition. Biological replicates are plotted separately for each genotype within each condition. The mean number of transcripts per genotype is plotted as a horizontal line. (C) The distribution of unique transcript count per cell across all growth conditions. (D) The distribution of unique transcript counts per cell across all genotypes.
Figure 2—figure supplement 2.
Figure 2—figure supplement 2.. Single-Cell RNA Expression Comparison.
(A) Pairwise ranked gene expression plots of cells grown in YPD to mid-log phase. Data sets are bulk counts [TRIZOL] from FY4/FY5 diploids (n = 6), 10x-based 3’ end-labeled single-cell counts [10x-scRNA] from FY4/FY5 diploids (n = 976), 5’ end-labeled single-cell counts [yscRNA (2019)] from BY4741 haploids (n = 127), SCnorm-calculated normalized counts Gasch et al. (2017) from BY4741 haploids (n = 163), and bulk transcripts per kilobase million [GSE135430] from BY4741 haploids (n = 12). (B) Correlation heatmap showing spearman’s rank correlation between each sample. (C) Counts per sample after sequencing and de-artifacting with UMIs for the bulk experiment on FY4/FY5 wild-type RNA extracted with TRIZOL (n = 6).
Figure 2—figure supplement 3.
Figure 2—figure supplement 3.. Expression of Categories of Genes in Single Cells.
UMAP projection of log-transformed and batch-normalized scRNAseq data, colored by: (A) Condition, as Figure 2B (B) Total raw count of transcripts (C) Percentage of transcripts which are ribosomal genes (D) Percentage of transcripts which are ribosomal biogenesis genes (E) Percentage of transcripts which are induced environmental stress response (iESR) genes (F) Percentage of transcripts which are mitochondrially-encoded genes.
Figure 2—figure supplement 4.
Figure 2—figure supplement 4.. Measures of Gene Variance in Each Condition.
(A) Coefficient of variation (standard deviation over mean) plotted against mean of each gene for each growth condition. Both axes are plotted on a log scale. (B) The mean pearson residuals (residuals over expected standard deviation) of a regularized negative binomial regression model calculated for each gene by the R package sctransform. The mean gene expression is plotted on a log scale.
Figure 3.
Figure 3.. Cells Within Conditions Cluster According to Cell Cycle Genes.
(A) Cells from each growth condition were separately normalized and transformed into 2-dimensional space using UMAP. The log-transformed, normalized expression for each cell of (i) the G1-phase specific marker PIR1, (ii) the G1-phase daughter-cell specific marker DSE2, (iii) the S-phase specific marker histone 2B (HTB) is shown; (iv) the genotype and (v) the cluster membership of each cell. (B) Summary of clustered single cell expression within the YPD and RAPA growth conditions (i) Proportion of cells from a specific strain genotype within each cluster (ii) The mean log-transformed, normalized expression of the G1- and S-phase marker genes, as well as a hexokinase gene HXK2 for each cluster (C) Schematic of the mitotic cell cycle with expression of DSE2, PIR1, and HTB genes annotated.
Figure 3—figure supplement 1.
Figure 3—figure supplement 1.. Expression of Important Genes For Clustering.
(A) Gene expression heatmap of genes that are specific for clusters of cells in YPD. Genes name are colored green for G1 phase genes, yellow for S phase genes, blue for G2 phase genes, and purple for M phase genes (B) Summary of clustered single cells within growth conditions (i) Proportion of cells within each cluster that consist of a specific strain genotype (ii) The mean within each cluster of log-transformed, normalized expression of the G1- and S-phase marker genes, as well as a hexokinase gene HXK2 that is unlikely to be responding to cell cycle.
Figure 3—figure supplement 2.
Figure 3—figure supplement 2.. Some Conditions Have Stress Response Clusters Cells from each growth condition were separately normalized and transformed into 2-dimensional space using UMAP.
These single cells are colored by (A) Total raw UMI count (B) Percentage of transcripts which are ribosomal genes (C) Percentage of transcripts which are induced environmental stress response (iESR) genes (D) Percentage of transcripts which annotated as G1 phase genes (E) Percentage of genes which are annotated as S phase genes (F) Percentage of genes which are annotated as G2 phase genes (G) Percentage of genes which are annotated as M phase genes.
Figure 4.
Figure 4.. Impact of Deleting Transcription Factors on Gene Expression.
(A) Violin plots of the log2 batch-normalized expression of the general amino acid permease gene GAP1 in YPD, RAPA, ammonium-limited media, and urea-limited media. (B) Count of differentially expressed genes in each combination of growth condition and strain genotype. Data were transformed to pseudobulk values by summing all counts for each the six biological replicates for each genotype and then analyzed for differential gene expression using DESeq2 [1.5-fold change; p.adj <0.05]. (C) Log2(fold change) of genes differentially expressed in TF knockout strains compared to wildtype, when grown in YPD. Asterisks denote statistically significant differences in gene expression [1.5-fold change; p.adj <0.05].
Figure 4—figure supplement 1.
Figure 4—figure supplement 1.. Differential Gene Expression Varies by Condition.
(A) Distribution by strain genotype of the log2 batch-normalized expression of the ammonium permease gene MEP2 and the glutamine synthetase gene GLN1 in YPD, RAPA, and ammonium limitation. (B) Number of differentially expressed genes identified after bulking between wild-type cells in each growth condition (C) Log2(fold change) of genes differentially expressed in TF knockout strains compared to wildtype, when grown in YPD and then treated with rapamycin. Asterisks denote statistically significant differences in gene expression [1.5-fold change; p.adj <0.05].
Figure 5.
Figure 5.. Model Performance and Impact of Data Imputation, Prior Selection, and Multitask learning on Network Inference using the Inferelator.
(A) Model performance of Inferelator (TFA-BBSR) network inference after shuffling priors [Neg.Shuffled], on a simulated negative data set [Neg. Data], on the unaltered count matrix [No Imputation], and after imputing missing data from the count matrix using the MAGIC, ScImpute, and VIPER packages. Model performance is shown using area under the precision-recall curve [AUPR], as well as the number of network edges using a precision (>0.5) cutoff, and the number of network edges using a confidence (>0.95) cutoff. Each point plotted in gray is a separate cross-validation analysis, with mean +/- one standard deviation plotted in black (n = 10). (B) Median AUPR after cross-validation (n = 10) and resampling to different numbers of cells, for priors extracted from the gold standard [GS], the YEASTRACT database, Bussemaker et al, priors predicted from ATAC-seq data and motif searching, and no prior data. (C) AUPR of separate cross-validation network inference using cells from all growth conditions, or from individual conditions separately. Each cross-validation (n = 10) was downsampled to the same number of cells. (D) Cross-validation (n = 10) using the YEASTRACT prior data. Networks are learned for all conditions together [BBSR (ALL) ●], for all conditions individually with TFA-BBSR followed by combination [BBSR (BY TASK) ▲], and for all conditions together in multi-task learning followed by combination [AMuSR (MTL) ◆]. Models are evaluated by (i) AUPR on the aggregate, final network and (ii) AUPR for each task-specific subnetwork from BBSR (BY TASK) (▲) and AMuSR (MTL) (◆).
Figure 5—figure supplement 1.
Figure 5—figure supplement 1.. Low-Dimensional Clustering of Imputed Data Scatter plot after UMAP into 2-dimensional space.
Unmodified data (A) is compared to imputation methods to recover missing data using MAGIC (B), ScImpute (C), and VIPER (D).
Figure 6.
Figure 6.. Reconstruction of a Gene Regulatory Network Identifies New Regulatory Relationships.
A network inferred from the single-cell expression data using multi-task learning and the YEASTRACT TF-gene interaction prior, with a cutoff at precision >0.5. (A) Network graph with known interaction edges from the prior in gray and new inferred interaction edges in red (B) Network graph of the 11 nitrogen-responsive transcription factors with known edges from the prior in gray and new edges in red (C) The number of interactions for each TF; interaction edges present in the prior that are not in the final network are included in black. The nitrogen TFs knocked out in this work are labeled in blue, and TFs with gene ontology annotations for mitotic cell cycle are annotated in green (D) Gene ontology classification of network interactions by the GO slim biological process terms annotated for the target gene and the regulatory TF (the GO term transcription from RNA pol II is omitted from the annotations for regulatory TFs).
Figure 6—figure supplement 1.
Figure 6—figure supplement 1.. Summary of Learned GRN.
(A) Histogram of the number of regulators per target gene in the learned and prior network (B) Hexagonal heatmap of the ranked expression of a gene against the number of regulators for that gene; r2 is calculated using Spearman’s Rank correlation (C) The number of interactions for a TF that are activating (positive) or repressing (negative). Bars are colored by the number of separate conditions in which the interaction is identified, with condition-specific networks selected using a precision threshold of 0.5 (D) The number of unique TF-gene interactions identified in condition-specific networks (E) The distribution of learned and prior interactions from the final, aggregate network in condition-specific networks (F) Evidence for learned network edges which are not in the prior. Interactions where YEASTRACT has evidence for a change in gene expression when the TF is perturbed are in tan. Interactions where YEASTRACT has evidence that the TF physically localizes to the gene are in blue. Interactions where YEASTRACT has no annotated evidence are in red.
Figure 7.
Figure 7.. Coordinated regulation of Nitrogen Response and Cell Cycle.
(A) A gene regulatory network showing target genes that are regulated by at least one nitrogen TF (blue) and at least one cell cycle TF (green). Target gene nodes are colored by GO slim term. Newly inferred regulatory edges are red and known regulatory edges from the prior are in gray. Transcription factor activity (TFA) is calculated from the learned network and then scaled to a z-score over all cells which do not have that TF deleted (e.g. gcn4Δ cells are omitted from the calculation for GCN4 TFA). The mean TFA z-score for four selected conditions is inset for GAAC and NCR TFs (B) TFA for cell cycle TFs for each cell in the YPD growth condition.
Figure 7—figure supplement 1.
Figure 7—figure supplement 1.. Cell Cycle TF Activity Clusters within Growth Conditions.
Cells grown in YPD are plotted after UMAP with (A–B) z-Score of the calculated transcription factor activity (TFA) based on the learned network for (A) select nitrogen TFs and (B) cell cycle TFs (C) Log2 of the expression of cell cycle TFs (D) Log2 of the expression of cell cycle target genes.

References

    1. Adamson B, Norman TM, Jost M, Cho MY, Nuñez JK, Chen Y, Villalta JE, Gilbert LA, Horlbeck MA, Hein MY, Pak RA, Gray AN, Gross CA, Dixit A, Parnas O, Regev A, Weissman JS. A multiplexed Single-Cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell. 2016;167:1867–1882. doi: 10.1016/j.cell.2016.11.048. - DOI - PMC - PubMed
    1. Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine JC, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S. SCENIC: single-cell regulatory network inference and clustering. Nature Methods. 2017;14:1083–1086. doi: 10.1038/nmeth.4463. - DOI - PMC - PubMed
    1. Airoldi EM, Miller D, Athanasiadou R, Brandt N, Abdul-Rahman F, Neymotin B, Hashimoto T, Bahmani T, Gresham D. Steady-state and dynamic gene expression programs in Saccharomyces cerevisiae in response to variation in environmental nitrogen. Molecular Biology of the Cell. 2016;27:1383–1396. doi: 10.1091/mbc.E14-05-1013. - DOI - PMC - PubMed
    1. Andréasson C, Ljungdahl PO. Receptor-mediated endoproteolytic activation of two transcription factors in yeast. Genes & Development. 2002;16:3158–3172. doi: 10.1101/gad.239202. - DOI - PMC - PubMed
    1. Arrieta-Ortiz ML, Hafemeister C, Bate AR, Chu T, Greenfield A, Shuster B, Barry SN, Gallitto M, Liu B, Kacmarczyk T, Santoriello F, Chen J, Rodrigues CD, Sato T, Rudner DZ, Driks A, Bonneau R, Eichenberger P. An experimentally supported model of the Bacillus subtilis global transcriptional regulatory network. Molecular Systems Biology. 2015;11:839. doi: 10.15252/msb.20156236. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

Associated data