Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul 2;8(7):e67019.
doi: 10.1371/journal.pone.0067019. Print 2013.

ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq

Affiliations

ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq

Andrew D Fernandes et al. PLoS One. .

Abstract

Experimental variance is a major challenge when dealing with high-throughput sequencing data. This variance has several sources: sampling replication, technical replication, variability within biological conditions, and variability between biological conditions. The high per-sample cost of RNA-Seq often precludes the large number of experiments needed to partition observed variance into these categories as per standard ANOVA models. We show that the partitioning of within-condition to between-condition variation cannot reasonably be ignored, whether in single-organism RNA-Seq or in Meta-RNA-Seq experiments, and further find that commonly-used RNA-Seq analysis tools, as described in the literature, do not enforce the constraint that the sum of relative expression levels must be one, and thus report expression levels that are systematically distorted. These two factors lead to misleading inferences if not properly accommodated. As it is usually only the biological between-condition and within-condition differences that are of interest, we developed ALDEx, an ANOVA-like differential expression procedure, to identify genes with greater between- to within-condition differences. We show that the presence of differential expression and the magnitude of these comparative differences can be reasonably estimated with even very small sample sizes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have the following interests: coauthor Andrew D. Fernandes is employed by YouKaryote Genomics. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Figures

Figure 1
Figure 1. Dirichlet-distributed proportions accurately account for the sampling variance.
This plot overlays the expected range between the 1–99% quantiles for formula image with observed range of formula image computed for the Liver library replicates in the L–K dataset. Marioni et al minimized technical error with an experimental design where the same Illumina library was run in two separate lanes. Monte Carlo Estimates of formula image are shown in red with the density of the values shown in orange, while 1–99% expected quantile ranges from the Dirichlet are shown in black. This demonstrates that the error inherent in high-throughput sequencing is greatest when the counts are small and least when the counts are large. The near-perfect overlay of actual and modelled values strongly support idea that modelling proportions through a Dirichlet-multinomial process accurately accounts for the sampling variance inherent in RNA-Seq, and by extension in other high-throughput sequencing analyses. The error in estimating the expected quantiles is observable by the size of the points plotted in black and becomes small when expression is non-trivial. Values on the formula image-axis were calculated with the given formula for formula image and were adjusted to remove the non-informative subspace as outlined in the text. Thus the formula image-axis value of zero corresponds to the expected per-gene logformula image-expression value.
Figure 2
Figure 2. Sampling and technical variance is distinct from within-condition variance.
A comparison of within-condition gene expression-difference to the median expression level for three different experiments. The left panel shows the sampling variance for comparison and the three experiments are shown in subsequent panels. The L-K RNA-Seq data set compares gene expression in two liver and two kidney samples. The Bacillus cereus RNA-Seq data set for samples grown at neutral pH and two samples 20 minutes after shift to grown at low pH. Meta-RNA-Seq data is for microbial gene expression analysis of four clinical vaginal samples from two women with a healthy microbiota and two women with a microbiota indicative of bacterial vaginosis. The RNA-Seq experiments in the L-K and B. cereus datasets were from controlled conditions with identical gene content per condition and show that the vast majority of highly-expressed genes have small within-condition estimates of formula image, and that estimate only becomes imprecise as formula image becomes very small. The Meta-RNA-Seq panel shows that when within-condition variance is high, there is no relationship between the expression level and the within-condition variance. Note that base-2 logarithms were used throughout.
Figure 3
Figure 3. Approximate ANOVA via Absolute and Relative fold differences.
The figure shows how the method explicitly accounts for the within-condition dispersion using as an example two genes with similar absolute fold differences (formula image) of -1.17 and -1.13 but very different relative fold differences (formula image) and formula image values in the L-K dataset. Dirichlet sampled distributions are generated from the raw read counts as described in the text. These distributions are log-transformed and the noninformative subspace is removed. Posterior distributions of formula image are shown for formula image and for replicates formula image. Both genes are abundantly expressed in this dataset, with median expression levels between formula image and formula image greater than the mean across-gene expression level. formula image is computed by randomly sampling one of the red distributions, randomly sampling one of the blue distributions, and subtracting for all pairs of between-condition distributions. formula image is computed by sampling a light and a dark from each red and blue distributions, subtracting light and dark, and selecting the difference with the greatest magnitude. formula image is computed as the ratio of a single realization of formula image and formula image, and is computed for each realization of formula image and formula image. The formula image distribution is narrower in A than in B implying a greater precision in estimating this value. This precision can be estimated by formula image, the quantile of zero in formula image, which is shown graphically as the black-filled area under the formula image distribution curves. The formula image values are 0.0001 and 0.035 in panels A and B respectively. The vertical arrows show the median values of formula image and formula image. Thus, the between-condition expression values for the gene in Panel A are scored as separable by ALDEx (formula image) but not for the gene in Panel B (formula image). These conclusions agree with inspection-based intuition from examining the initial adjusted log-expression distributions that are shown in the left panel.
Figure 4
Figure 4. Fold-change to variance (MW) plots of different ALDEx cutoff values.
Plotted here are formula image (i.e., the formula image-fold expression changes) vs. formula image (i.e., the maximum within-condition expression differences) for the L-K and Meta-RNA-Seq datasets. The grey background shows the density plot of the values. By construction, threshold formula image values of formula image should select genes for which the between-condition variation is reasonably likely to be at least twice the within-condition variation. This threshold is illustrated by the solid grey line, and the within/between-condition equivalence line is shown as the dashed grey line. The effect of altering the formula image and formula image cutoffs is shown. Large formula image values identify genes with a larger between-condition differential expression than within-condition variation. The formula image values were formula image (blue), formula image (cyan), formula image (yellow) and formula image (red). There are a number of points where the formula image values exceed the chosen threshold in both the L-K and Meta datasets. The formula image column shows the effect of identifying genes based on the estimated proportion of formula image. These are plotted for values of 0.1, 0.05, 0.01 and 0.001 in the two datasets, again coloured as blue, cyan, yellow or red respectively. Here we observed that the 0.01 and 0.001 cutoffs were very similar and the larger cutoffs admitted many more genes with high within-condition variation. The third panel shows the effect of enforcing the formula image along with each of the formula image cutoff values. In this case we observe that a formula image combined with an formula image is sufficient to ensure that genes identified as differentially expressed always have lower formula image than formula image values in both datasets. Using both values in combination allows arbitrarily small fold expression differences to be identified if they are supported by high within-condition concordance.
Figure 5
Figure 5. Comparison of four differential expression methods in the B. cereus dataset.
Transcript abundances identified as differential by the first three methods are highlighted in red on a background density plot. Default false discovery rates for each program were used, 0.05 for CuffDiff and 0.1 for both edgeR and DESeq, since these reflect the configurations in which most users will use these programs. In the case of ALDEx transcripts with formula image are highlighted in red and orange for formula image and formula image. Transcripts originating from genes contained on the plasmid that is found in one sample from each condition are circled. The top row shows typical Bland-Altman style (MA) plots where the median absolute fold change (formula image) is plotted vs. the mean expression value (Expression). The mean expression value on x-axis is 0 for the reasons outlined in the text. Notice that the edgeR method identifies differentially-expressed transcripts with much lower abundances than the other three methods. The plasmid-encoded genes are not differentiated on the Bland-Altman-style plots. The bottom row shows an MW plot of the median absolute fold change between-conditions (formula image) vs. the maximum within-condition difference (formula image) of the same data. Here it is clear that transcripts originating from the plasmid-encoded genes exhibit very large formula image values. Interestingly, there are a number of chromosomally-encoded genes in this dataset, and in the other two (see previous figure) that also show large formula image values, demonstrating that within-condition variation can be problematic even for samples derived from well-controlled conditions. Both CuffDiff and edgeR identified as differentially expressed a significant fraction of the plasmid-derived transcripts.
Figure 6
Figure 6. Venn diagram of the four differential expression methods in the B. cereus dataset.
Transcript abundances were identified as differential as in Figure 5. The overlap between the number of differentially expressed transcripts for each method is given in the individual cells of the diagram. The number of differentially regulated transcripts for each method is: ALDEx 1614, DESeq 1587, edgeR 1393, CuffDiff 1465. The diagram was prepared using the Venny web tool(Available: http://bioinfogp.cnb.csic.es/tools/venny. Accessed May 23, 2013) .
Figure 7
Figure 7. True and False positive identification in simulated data.
A set of eleven genes with simulated read counts between 1 and 1024 in two-fold increments were appended twice to a single sample of the B. cereus dataset. Two conditions were generated by multiplying the counts for a single set of simulated genes in each condition by the fold-difference values indicated in the True positive panel on the left, and two simulated technical replicates were generated for each condition by sampling from the Dirichlet distribution which accurately models technical variance in these datasets (Figure 1). The resulting four samples were examined by DESeq, edgeR and ALDEx for the ability of each method to identify the simulated differentially-expressed genes. The fold change varied between 1.1 and 10 and 100 simulations were run for each fold change. The fold change value is overlaid on the corresponding curve in the left panel. The line colors are black for edgeR, blue for DESeq and red for ALDEx, and the symbols are the same for each fold change value across each method. The ALDEx formula image cutoff of 1.5 is a solid and 2.0 is a dashed line. The right panel shows the per-gene false positive rate for each method at two cutoffs. False positive events in this model can only arise through outliers in the Dirichlet sampling procedure. The rate was calculated by dividing the number of false positive genes identified in each trial by the number of genes in the dataset (5358). The boxplot shows the range of false positive rates observed for each method across all trials and all expression levels. A rate of 0.0002 corresponds to approximately 1 false positive per trial in this dataset.
Figure 8
Figure 8. Comparison of three differential expression methods in the Meta dataset.
This dataset contains extreme transcript abundance variation within- and between-conditions. In this dataset the ALDEx method exhibits similar characteristics as in the B. cereus dataset, in that it identifies as differential those transcripts that exhibit high between-condition variation and low within-condition variation. This is illustrated by the left column that shows an MA-like plot and an MW plot. It is obvious that the high levels of variation in the Meta dataset is a poor fit to the negative binomial model used by both edgeR and DESeq. The edgeR package appeared to enforce a relatively high between-condition differential expression level regardless of the mean expression value. This leads to many poorly expressed transcripts being identified as differentially expressed. In contrast, the DESeq package identified as differentially expressed a small number of highly expressed genes. As before, transcripts with differential abundances are coloured red (and orange) as per the rules outlined in Figure 5.
Figure 9
Figure 9. The effect of organism abundance on transcript abundance in the Meta dataset.
In each panel, the dark colour indicates a gene that maps to that organism (or organism group) and the bright colour indicates a gene that is differentially expressed according to the rules enforced by ALDEx. The Meta dataset contains different mixtures of organisms in each sample as shown in Table 2. This leads to widely different distributions of transcript abundance in this dataset, which can be classified into three general patterns. Transcripts from organisms that are abundant (formula image) in three or four samples can exhibit both up and down regulation in the two conditions. For example, transcripts that were derived from L. iners are seen to be both up- and down-regulated in this dataset. Transcripts from organisms that are abundant in both samples of one condition (e.g. G. vaginalis), exhibit a change in one direction only, and show the full range of within-condition variation. In this case, many genes from the organism are expressed concordantly, and can be identified as differential. Transcripts from organisms that are abundant in only one sample of one condition, e.g. Megasphaera species, show a similar pattern as the previous one, but there is much more extreme variation within-conditions, and only those few genes that are expressed at extremely high levels can be reliably called as differentially expressed.

Similar articles

Cited by

References

    1. Roy NC, Altermann E, Park ZA, McNabb WC (2011) A comparison of analog and next-generation transcriptomic tools for mammalian studies. Brief Funct Genomics 10: 135–50. - PubMed
    1. Crawford JE, Guelbeogo WM, Sanou A, Traoré A, Vernick KD, et al. (2010) De novo transcriptome sequencing in Anopheles funestus using Illumina RNA-Seq technology. PLoS One 5: e14202. - PMC - PubMed
    1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29: 644–52. - PMC - PubMed
    1. Trapnell C, Pachter L, Salzberg SL (2009) Tophat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–11. - PMC - PubMed
    1. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, et al. (2010) Mapsplice: accurate mapping of RNA-Seq reads for splice junction discovery. Nucleic Acids Res 38: e178. - PMC - PubMed

Publication types

MeSH terms