. 2013 Jul 2;8(7):e67019.

doi: 10.1371/journal.pone.0067019. Print 2013.

ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq

Andrew D Fernandes¹, Jean M Macklaim, Thomas G Linn, Gregor Reid, Gregory B Gloor

Affiliations

PMID: 23843979
PMCID: PMC3699591
DOI: 10.1371/journal.pone.0067019

ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq

Andrew D Fernandes et al. PLoS One. 2013.

. 2013 Jul 2;8(7):e67019.

doi: 10.1371/journal.pone.0067019. Print 2013.

Authors

Andrew D Fernandes¹, Jean M Macklaim, Thomas G Linn, Gregor Reid, Gregory B Gloor

Affiliation

¹ YouKaryote Genomics, London, Ontario, Canada.

PMID: 23843979
PMCID: PMC3699591
DOI: 10.1371/journal.pone.0067019

Abstract

Experimental variance is a major challenge when dealing with high-throughput sequencing data. This variance has several sources: sampling replication, technical replication, variability within biological conditions, and variability between biological conditions. The high per-sample cost of RNA-Seq often precludes the large number of experiments needed to partition observed variance into these categories as per standard ANOVA models. We show that the partitioning of within-condition to between-condition variation cannot reasonably be ignored, whether in single-organism RNA-Seq or in Meta-RNA-Seq experiments, and further find that commonly-used RNA-Seq analysis tools, as described in the literature, do not enforce the constraint that the sum of relative expression levels must be one, and thus report expression levels that are systematically distorted. These two factors lead to misleading inferences if not properly accommodated. As it is usually only the biological between-condition and within-condition differences that are of interest, we developed ALDEx, an ANOVA-like differential expression procedure, to identify genes with greater between- to within-condition differences. We show that the presence of differential expression and the magnitude of these comparative differences can be reasonably estimated with even very small sample sizes.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have the following interests: coauthor Andrew D. Fernandes is employed by YouKaryote Genomics. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Figures

**Figure 1. Dirichlet-distributed proportions accurately account for the sampling variance.**
This plot overlays the expected range between the 1–99% quantiles for with observed range of computed for the Liver library replicates in the L–K dataset. Marioni et al minimized technical error with an experimental design where the same Illumina library was run in two separate lanes. Monte Carlo Estimates of are shown in red with the density of the values shown in orange, while 1–99% expected quantile ranges from the Dirichlet are shown in black. This demonstrates that the error inherent in high-throughput sequencing is greatest when the counts are small and least when the counts are large. The near-perfect overlay of actual and modelled values strongly support idea that modelling proportions through a Dirichlet-multinomial process accurately accounts for the sampling variance inherent in RNA-Seq, and by extension in other high-throughput sequencing analyses. The error in estimating the expected quantiles is observable by the size of the points plotted in black and becomes small when expression is non-trivial. Values on the -axis were calculated with the given formula for and were adjusted to remove the non-informative subspace as outlined in the text. Thus the -axis value of zero corresponds to the expected per-gene log-expression value.

formula image — **Figure 1. Dirichlet-distributed proportions accurately account for the sampling variance.**
This plot overlays the expected range between the 1–99% quantiles for with observed range of computed for the Liver library replicates in the L–K dataset. Marioni et al minimized technical error with an experimental design where the same Illumina library was run in two separate lanes. Monte Carlo Estimates of are shown in red with the density of the values shown in orange, while 1–99% expected quantile ranges from the Dirichlet are shown in black. This demonstrates that the error inherent in high-throughput sequencing is greatest when the counts are small and least when the counts are large. The near-perfect overlay of actual and modelled values strongly support idea that modelling proportions through a Dirichlet-multinomial process accurately accounts for the sampling variance inherent in RNA-Seq, and by extension in other high-throughput sequencing analyses. The error in estimating the expected quantiles is observable by the size of the points plotted in black and becomes small when expression is non-trivial. Values on the -axis were calculated with the given formula for and were adjusted to remove the non-informative subspace as outlined in the text. Thus the -axis value of zero corresponds to the expected per-gene log-expression value.

**Figure 2. Sampling and technical variance is distinct from within-condition variance.**
A comparison of within-condition gene expression-difference to the median expression level for three different experiments. The left panel shows the sampling variance for comparison and the three experiments are shown in subsequent panels. The L-K RNA-Seq data set compares gene expression in two liver and two kidney samples. The *Bacillus cereus* RNA-Seq data set for samples grown at neutral pH and two samples 20 minutes after shift to grown at low pH. Meta-RNA-Seq data is for microbial gene expression analysis of four clinical vaginal samples from two women with a healthy microbiota and two women with a microbiota indicative of bacterial vaginosis. The RNA-Seq experiments in the L-K and *B. cereus* datasets were from controlled conditions with identical gene content per condition and show that the vast majority of highly-expressed genes have small within-condition estimates of , and that estimate only becomes imprecise as becomes very small. The Meta-RNA-Seq panel shows that when within-condition variance is high, there is no relationship between the expression level and the within-condition variance. Note that base-2 logarithms were used throughout.

**Figure 3. Approximate ANOVA via Absolute and Relative fold differences.**
The figure shows how the method explicitly accounts for the within-condition dispersion using as an example two genes with similar absolute fold differences () of -1.17 and -1.13 but very different relative fold differences () and values in the L-K dataset. Dirichlet sampled distributions are generated from the raw read counts as described in the text. These distributions are log-transformed and the noninformative subspace is removed. Posterior distributions of are shown for and for replicates . Both genes are abundantly expressed in this dataset, with median expression levels between and greater than the mean across-gene expression level. is computed by randomly sampling one of the red distributions, randomly sampling one of the blue distributions, and subtracting for all pairs of between-condition distributions. is computed by sampling a light and a dark from each red and blue distributions, subtracting light and dark, and selecting the difference with the greatest magnitude. is computed as the ratio of a single realization of and , and is computed for each realization of and . The distribution is narrower in A than in B implying a greater precision in estimating this value. This precision can be estimated by , the quantile of zero in , which is shown graphically as the black-filled area under the distribution curves. The values are 0.0001 and 0.035 in panels A and B respectively. The vertical arrows show the median values of and . Thus, the between-condition expression values for the gene in Panel A are scored as separable by ALDEx () but not for the gene in Panel B (). These conclusions agree with inspection-based intuition from examining the initial adjusted log-expression distributions that are shown in the left panel.

**Figure 4. Fold-change to variance (MW) plots of different ALDEx cutoff values.**
Plotted here are (i.e., the -fold expression changes) vs. (i.e., the maximum within-condition expression differences) for the L-K and Meta-RNA-Seq datasets. The grey background shows the density plot of the values. By construction, threshold values of should select genes for which the between-condition variation is reasonably likely to be at least twice the within-condition variation. This threshold is illustrated by the solid grey line, and the within/between-condition equivalence line is shown as the dashed grey line. The effect of altering the and cutoffs is shown. Large values identify genes with a larger between-condition differential expression than within-condition variation. The values were (blue), (cyan), (yellow) and (red). There are a number of points where the values exceed the chosen threshold in both the L-K and Meta datasets. The column shows the effect of identifying genes based on the estimated proportion of . These are plotted for values of 0.1, 0.05, 0.01 and 0.001 in the two datasets, again coloured as blue, cyan, yellow or red respectively. Here we observed that the 0.01 and 0.001 cutoffs were very similar and the larger cutoffs admitted many more genes with high within-condition variation. The third panel shows the effect of enforcing the along with each of the cutoff values. In this case we observe that a combined with an is sufficient to ensure that genes identified as differentially expressed always have lower than values in both datasets. Using both values in combination allows arbitrarily small fold expression differences to be identified if they are supported by high within-condition concordance.

**Figure 5. Comparison of four differential expression methods in the *B. cereus* dataset.**
Transcript abundances identified as differential by the first three methods are highlighted in red on a background density plot. Default false discovery rates for each program were used, 0.05 for CuffDiff and 0.1 for both edgeR and DESeq, since these reflect the configurations in which most users will use these programs. In the case of ALDEx transcripts with are highlighted in red and orange for and . Transcripts originating from genes contained on the plasmid that is found in one sample from each condition are circled. The top row shows typical Bland-Altman style (MA) plots where the median absolute fold change () is plotted vs. the mean expression value (Expression). The mean expression value on x-axis is 0 for the reasons outlined in the text. Notice that the edgeR method identifies differentially-expressed transcripts with much lower abundances than the other three methods. The plasmid-encoded genes are not differentiated on the Bland-Altman-style plots. The bottom row shows an MW plot of the median absolute fold change between-conditions () vs. the maximum within-condition difference () of the same data. Here it is clear that transcripts originating from the plasmid-encoded genes exhibit very large values. Interestingly, there are a number of chromosomally-encoded genes in this dataset, and in the other two (see previous figure) that also show large values, demonstrating that within-condition variation can be problematic even for samples derived from well-controlled conditions. Both CuffDiff and edgeR identified as differentially expressed a significant fraction of the plasmid-derived transcripts.

**Figure 6. Venn diagram of the four differential expression methods in the *B. cereus* dataset.**
Transcript abundances were identified as differential as in Figure 5. The overlap between the number of differentially expressed transcripts for each method is given in the individual cells of the diagram. The number of differentially regulated transcripts for each method is: ALDEx 1614, DESeq 1587, edgeR 1393, CuffDiff 1465. The diagram was prepared using the Venny web tool(Available: http://bioinfogp.cnb.csic.es/tools/venny. Accessed May 23, 2013) .

**Figure 7. True and False positive identification in simulated data.**
A set of eleven genes with simulated read counts between 1 and 1024 in two-fold increments were appended twice to a single sample of the *B. cereus* dataset. Two conditions were generated by multiplying the counts for a single set of simulated genes in each condition by the fold-difference values indicated in the True positive panel on the left, and two simulated technical replicates were generated for each condition by sampling from the Dirichlet distribution which accurately models technical variance in these datasets (Figure 1). The resulting four samples were examined by DESeq, edgeR and ALDEx for the ability of each method to identify the simulated differentially-expressed genes. The fold change varied between 1.1 and 10 and 100 simulations were run for each fold change. The fold change value is overlaid on the corresponding curve in the left panel. The line colors are black for edgeR, blue for DESeq and red for ALDEx, and the symbols are the same for each fold change value across each method. The ALDEx cutoff of 1.5 is a solid and 2.0 is a dashed line. The right panel shows the per-gene false positive rate for each method at two cutoffs. False positive events in this model can only arise through outliers in the Dirichlet sampling procedure. The rate was calculated by dividing the number of false positive genes identified in each trial by the number of genes in the dataset (5358). The boxplot shows the range of false positive rates observed for each method across all trials and all expression levels. A rate of 0.0002 corresponds to approximately 1 false positive per trial in this dataset.

**Figure 8. Comparison of three differential expression methods in the Meta dataset.**
This dataset contains extreme transcript abundance variation within- and between-conditions. In this dataset the ALDEx method exhibits similar characteristics as in the *B. cereus* dataset, in that it identifies as differential those transcripts that exhibit high between-condition variation and low within-condition variation. This is illustrated by the left column that shows an MA-like plot and an MW plot. It is obvious that the high levels of variation in the Meta dataset is a poor fit to the negative binomial model used by both edgeR and DESeq. The edgeR package appeared to enforce a relatively high between-condition differential expression level regardless of the mean expression value. This leads to many poorly expressed transcripts being identified as differentially expressed. In contrast, the DESeq package identified as differentially expressed a small number of highly expressed genes. As before, transcripts with differential abundances are coloured red (and orange) as per the rules outlined in Figure 5.

**Figure 9. The effect of organism abundance on transcript abundance in the Meta dataset.**
In each panel, the dark colour indicates a gene that maps to that organism (or organism group) and the bright colour indicates a gene that is differentially expressed according to the rules enforced by ALDEx. The Meta dataset contains different mixtures of organisms in each sample as shown in Table 2. This leads to widely different distributions of transcript abundance in this dataset, which can be classified into three general patterns. Transcripts from organisms that are abundant () in three or four samples can exhibit both up and down regulation in the two conditions. For example, transcripts that were derived from *L. iners* are seen to be both up- and down-regulated in this dataset. Transcripts from organisms that are abundant in both samples of one condition (e.g. *G. vaginalis*), exhibit a change in one direction only, and show the full range of within-condition variation. In this case, many genes from the organism are expressed concordantly, and can be identified as differential. Transcripts from organisms that are abundant in only one sample of one condition, e.g. *Megasphaera* species, show a similar pattern as the previous one, but there is much more extreme variation within-conditions, and only those few genes that are expressed at extremely high levels can be reliably called as differentially expressed.

See this image and copyright information in PMC

References

1. Roy NC, Altermann E, Park ZA, McNabb WC (2011) A comparison of analog and next-generation transcriptomic tools for mammalian studies. Brief Funct Genomics 10: 135–50. - PubMed
1. Crawford JE, Guelbeogo WM, Sanou A, Traoré A, Vernick KD, et al. (2010) De novo transcriptome sequencing in Anopheles funestus using Illumina RNA-Seq technology. PLoS One 5: e14202. - PMC - PubMed
1. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29: 644–52. - PMC - PubMed
1. Trapnell C, Pachter L, Salzberg SL (2009) Tophat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–11. - PMC - PubMed
1. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, et al. (2010) Mapsplice: accurate mapping of RNA-Seq reads for splice junction discovery. Nucleic Acids Res 38: e178. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

Canadian Institutes of Health Research/Canada

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq

Affiliation

ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials