Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar;19 Suppl 1(Suppl 1):212-27.
doi: 10.1111/j.1365-294X.2010.04472.x.

Key considerations for measuring allelic expression on a genomic scale using high-throughput sequencing

Affiliations

Key considerations for measuring allelic expression on a genomic scale using high-throughput sequencing

Pierre Fontanillas et al. Mol Ecol. 2010 Mar.

Abstract

Differences in gene expression are thought to be an important source of phenotypic diversity, so dissecting the genetic components of natural variation in gene expression is important for understanding the evolutionary mechanisms that lead to adaptation. Gene expression is a complex trait that, in diploid organisms, results from transcription of both maternal and paternal alleles. Directly measuring allelic expression rather than total gene expression offers greater insight into regulatory variation. The recent emergence of high-throughput sequencing offers an unprecedented opportunity to study allelic transcription at a genomic scale for virtually any species. By sequencing transcript pools derived from heterozygous individuals, estimates of allelic expression can be directly obtained. The statistical power of this approach is influenced by the number of transcripts sequenced and the ability to unambiguously assign individual sequence fragments to specific alleles on the basis of transcribed nucleotide polymorphisms. Here, using mathematical modelling and computer simulations, we determine the minimum sequencing depth required to accurately measure relative allelic expression and detect allelic imbalance via high-throughput sequencing under a variety of conditions. We conclude that, within a species, a minimum of 500-1000 sequencing reads per gene are needed to test for allelic imbalance, and consequently, at least five to 10 millions reads are required for studying a genome expressing 10 000 genes. Finally, using 454 sequencing, we illustrate an application of allelic expression by testing for cis-regulatory divergence between closely related Drosophila species.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest

The authors have no conflict of interest to declare and note that the funders of this research had no role in the study design, data collection and analysis, decision to publish or preparation of the manuscript.

Figures

Fig. 1
Fig. 1
High-throughput sequencing technology allows measurement of relative allelic expression genome wide. The schematic representation shown illustrates the steps require to collect allelic expression data. Key parameters associated with each step that ultimately affect the statistical power for detecting significant allelic imbalance (AI) are also shown.
Fig. 2
Fig. 2
The expected proportion of informative reads increases with genetic divergence and read length. (A, B) Black lines show expected proportions of informative reads (i.e. sequence fragments that could be unambiguously assigned to one allele) predicted by eqn (3) for transcribed sequences containing 0.1, 0.5, 1 or 5% sequence divergence, as indicated. Predictions are shown in which either one single nucleotide polymorphism (SNP) (A) or two SNPs (B) were required for a sequencing read to be informative for measuring allelic expression. (C, D) Predictions based on 0.1% and 1% sequence divergence and requiring only one SNP to be informative are shown again, as they were in (A). Results from simulated data sets are also shown. Each simulation contained either 20 (C) or 200 (D) reads that were generated using a virtual 2000 bp mRNA sequence, 0.1% or 1% sequence divergence, and sequencing reads of 35, 150, 300 and 800 bp. Each scenario was simulated 500 times, and is summarized by boxplots showing the median, lower and upper quartiles, as well as the 1.5 interquartile range. The gray lines are the 95% confidence intervals of the expected proportions based on binomial sampling (Clopper-Pearson interval on eqn (1), Clopper & Pearson 1934).
Fig. 3
Fig. 3
Predicted proportions of genes with more than 200 informative reads for a given sequencing depth are consistent with simulated data. Predicted values (lines) were obtained using eqn (14), assuming a mean read length of 150 bp and sequence divergence of 0.1%, .5%, 1%, and 5%, as indicated. Simulated data (points) used distributions of transcript abundance, read length, and sequence divergence, as shown for the insets. Two replicate simulations were performed and found to be highly correlated with each other (Spearman’s Rho >99%).
Fig. 4
Fig. 4
Detecting significant allelic imbalance (AI) for genes with small differences in allelic expression requires a large number of informative reads per gene. Statistical power for detecting significant AI for a type I error α = 5% under different conditions is shown. Each line shows the power to detect significant AI, assuming that the true value of AI is 1, 1.25, 1.5, 2, 7, 10 or 100.
Fig. 5
Fig. 5
Read sampling strategy affects the proportion of informative reads per gene and thus the number of genes for which significant allelic imbalance (AI) can be detected. (A, B) Simulated proportions of genes with more than 200 informative reads using a random or a targeted read sampling strategy are shown for mean read lengths of 35 bp (A) and 150 bp (B), with individual reads sampled from a Poisson distribution. See Fig. 2 for a more detailed description of the simulation parameters. (C) The proportion of informative reads per gene using random (left), targeted (middle) and mixed (right) sampling strategies are shown. Each beanplot represents the distribution (500 replicates) of the proportion of informative reads among 500 sampled reads. The horizontal bar shown on each beanplot indicates the mean of these distributions. For the mixed strategy, fragments with sequence lengths drawn from a Poisson distribution with a mean of 500 bp were anchored to a fixed, predetermined location (the 3′ end), and sequences of either 18 or 75 bp were taken from each end to simulate paired-end sequencing.
Fig. 6
Fig. 6
Analysis of allelic expression and allelic imbalance (AI) in Drosophila F1 hybrids. (A) The distribution of sequence fragment lengths for both informative and uninformative reads is shown. (B) The number of genes in different gene expression level classes (as measured by the abundance of informative reads) are shown along with the number of genes in each class that showed significant AI. (C) The top panel shows the proportion of genes with significant AI (see Table S1, Supporting information) for which the D. melanogaster allele is most abundant. The bottom panel shows the proportion of informative reads in a given expression level class that were assigned to D. melanogaster. In both panels, the dotted line corresponds to a balanced proportion (50%). (D) The relationship between relative allelic expression as measured by 454 sequencing and by pyrosequencing is shown. For pyrosequencing, the average of eight replicates is plotted and the 95% confidence intervals are indicated by the horizontal bars. For 454 sequencing, the relative number of informative reads is shown, with vertical bars indicating the Clooper-Pearson 95% confidence intervals derived from binomial sampling (see Supplementary Fig. S3). The dotted line indicates the slope of the nonparametric regression.

References

    1. Andolfatto P. Adaptive evolution of non-coding DNA in Drosophila. Nature. 2005;437:1149–1152. - PubMed
    1. Ayroles JF, Carbone MA, Stone EA, et al. Systems genetics of complex traits in Drosophila melanogaster. Nature Genetics. 2009;41:299–307. - PMC - PubMed
    1. Bateman A, Quackenbush J. Bioinformatics for next generation sequencing. Bioinformatics. 2009;25:429. - PubMed
    1. Bergen AW, Baccarelli A, McDaniel TK, et al. cis sequence effects on gene expression. BMC Genomics. 2007;8:296. - PMC - PubMed
    1. Brem RB, Yvert G, Clinton R, Kruglyak L. Genetic dissection of transcriptional regulation in budding yeast. Science. 2002;296:752–755. - PubMed

Publication types