Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 28:6:595.
doi: 10.12688/f1000research.11290.1. eCollection 2017.

Gene length and detection bias in single cell RNA sequencing protocols

Affiliations

Gene length and detection bias in single cell RNA sequencing protocols

Belinda Phipson et al. F1000Res. .

Abstract

Background: Single cell RNA sequencing (scRNA-seq) has rapidly gained popularity for profiling transcriptomes of hundreds to thousands of single cells. This technology has led to the discovery of novel cell types and revealed insights into the development of complex tissues. However, many technical challenges need to be overcome during data generation. Due to minute amounts of starting material, samples undergo extensive amplification, increasing technical variability. A solution for mitigating amplification biases is to include unique molecular identifiers (UMIs), which tag individual molecules. Transcript abundances are then estimated from the number of unique UMIs aligning to a specific gene, with PCR duplicates resulting in copies of the UMI not included in expression estimates. Methods: Here we investigate the effect of gene length bias in scRNA-Seq across a variety of datasets that differ in terms of capture technology, library preparation, cell types and species. Results: We find that scRNA-seq datasets that have been sequenced using a full-length transcript protocol exhibit gene length bias akin to bulk RNA-seq data. Specifically, shorter genes tend to have lower counts and a higher rate of dropout. In contrast, protocols that include UMIs do not exhibit gene length bias, with a mostly uniform rate of dropout across genes of varying length. Across four different scRNA-Seq datasets profiling mouse embryonic stem cells (mESCs), we found the subset of genes that are only detected in the UMI datasets tended to be shorter, while the subset of genes detected only in the full-length datasets tended to be longer. Conclusions: We find that the choice of scRNA-seq protocol influences the detection rate of genes, and that full-length datasets exhibit gene-length bias. In addition, despite clear differences between UMI and full-length transcript data, we illustrate that full-length and UMI data can be combined to reveal the underlying biology influencing expression of mESCs.

Keywords: differential expression; gene detection rate; gene length bias; single cell RNA sequencing; unique molecular identifiers.

PubMed Disclaimer

Conflict of interest statement

Competing interests: No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Gene length bias is present in non-UMI protocols.
Three different datasets were analysed: ( ac) mouse embryonic stem cells, n=530 ( Kolodziejczyk et al., 2015), ( df) human primordial germ cells, n=226 ( Guo et al., 2015), ( gi) human brain whole organoids, n=494 ( Camp et al., 2015). For all plots ( ai), the x-axis shows 10 gene length bins all containing roughly equal numbers of genes. The left panel shows gene-wise average log counts, the middle panel shows proportion of zeroes in each gene (dropout rate per gene), and the right panel shows average log counts corrected for gene length (RPKM).
Figure 2.
Figure 2.. Gene length bias is absent in UMI-based protocols.
Three different datasets were analysed: ( ac) mouse embryonic stem cells n=127 ( Grün et al., 2014), ( df) human induced pluripotent stem cells n=671 ( Tung et al., 2016), and ( gi) human leukemia cell line K562 cells, n=219 ( Klein et al., 2015). For all plots ( ai), the x-axis shows 10 gene length bins all containing roughly equal numbers of genes. The left panel shows gene-wise average log counts, the middle panel shows proportion of zeroes in each gene (dropout rate per gene), and the right panel shows average log expression corrected for gene length (RPKM).
Figure 3.
Figure 3.. Combining four mouse embryonic stem cell datasets.
Four different mouse embryonic stem cell datasets were combined, two full-length transcript ( Buettner et al., 2015; Kolodziejczyk et al., 2015) and two UMI datasets ( Grün et al., 2014; Ziegenhain et al., 2016). ( a) Principal component analysis plot (coloured by dataset) shows the major source of variation between the cells is the dataset, with the UMI datasets on the left and the full-length datasets on the right. ( b) Examining principal components two and three reveals that the next major source of variation in the data is the media in which cells are grown. In particular three datasets (two full-length and one UMI) which have cells grown in standard media with 2i inhibitors all cluster together on the left. J1, Rex1 and G4 refer to the mESC cell line. The Ziegenhain dataset has single cells profiled in two batches. ( cd) Gene length bias is present in full-length mESC datasets; dotted grey line is the median log-count in the first gene length bin. ( ef) Gene length bias is absent in UMI mESC datasets; dotted grey line is the median log-count in the first gene length bin.
Figure 4.
Figure 4.. Detection differences in UMI and full-length mESC datasets.
( a) A Venn diagram comparing the number of genes detected in two UMI mESC datasets, with the number detected in the two full-length datasets. We find that while the majority of genes are detected in all datasets (n=8689), there are genes that are uniquely detected when using either a full-length or UMI protocol. ( b) Density plots of gene length for the subsets of genes corresponding to the Venn diagram in ( a). The uniquely detected genes for the UMI datasets (blue line) tend to be shorter than the uniquely detected genes in the full-length datasets (red line), p=0.000297. ( c) A Venn diagram showing the number of enriched GO categories in the 188 genes unique to UMIs and the 2649 genes unique to the full-length protocols. This reveals that these genes interrogate different biology, with only 3 GO categories in common. ( d) Density plots of average gene length for each GO category corresponding to the significantly enriched GO categories in ( c). We assigned each GO category an average length by calculating the median of the lengths of all genes annotated to each GO category. While there is not a significant shift in location in the density plots we noted a much greater spread of median length in the enriched GO categories for the uniquely detected UMI genes, largely driven by the presence of GO categories that tend to have very short genes.

References

    1. Buettner F, Natarajan KN, Casale FP, et al. : Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60. 10.1038/nbt.3102 - DOI - PubMed
    1. Camp JG, Badsha F, Florio M, et al. : Human cerebral organoids recapitulate gene expression programs of fetal neocortex development. Proc Natl Acad Sci U S A. 2015;112(51):15672–7. 10.1073/pnas.1520760112 - DOI - PMC - PubMed
    1. Dobin A, Davis CA, Schlesinger F, et al. : STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. 10.1093/bioinformatics/bts635 - DOI - PMC - PubMed
    1. Ewels P, Magnusson M, Lundin S, et al. : MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047–3048. 10.1093/bioinformatics/btw354 - DOI - PMC - PubMed
    1. Gentleman RC, Carey VJ, Bates DM, et al. : Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. 10.1186/gb-2004-5-10-r80 - DOI - PMC - PubMed