. 2009 Aug 4:10:237.

doi: 10.1186/1471-2105-10-237.

Resolving deconvolution ambiguity in gene alternative splicing

Yiyuan She¹, Earl Hubbell, Hui Wang

Affiliations

PMID: 19653895
PMCID: PMC2739860
DOI: 10.1186/1471-2105-10-237

Resolving deconvolution ambiguity in gene alternative splicing

Yiyuan She et al. BMC Bioinformatics. 2009.

. 2009 Aug 4:10:237.

doi: 10.1186/1471-2105-10-237.

Authors

Yiyuan She¹, Earl Hubbell, Hui Wang

Affiliation

¹ Affymetrix Inc, Santa Clara, CA 95051, USA. yshe@stat.fsu.edu

PMID: 19653895
PMCID: PMC2739860
DOI: 10.1186/1471-2105-10-237

Abstract

Background: For many gene structures it is impossible to resolve intensity data uniquely to establish abundances of splice variants. This was empirically noted by Wang et al. in which it was called a "degeneracy problem". The ambiguity results from an ill-posed problem where additional information is needed in order to obtain an unique answer in splice variant deconvolution.

Results: In this paper, we analyze the situations under which the problem occurs and perform a rigorous mathematical study which gives necessary and sufficient conditions on how many and what type of constraints are needed to resolve all ambiguity. This analysis is generally applicable to matrix models of splice variants. We explore the proposal that probe sequence information may provide sufficient additional constraints to resolve real-world instances. However, probe behavior cannot be predicted with sufficient accuracy by any existing probe sequence model, and so we present a Bayesian framework for estimating variant abundances by incorporating the prediction uncertainty from the micro-model of probe responsiveness into the macro-model of probe intensities.

Conclusion: The matrix analysis of constraints provides a tool for detecting real-world instances in which additional constraints may be necessary to resolve splice variants. While purely mathematical constraints can be stated without error, real-world constraints may themselves be poorly resolved. Our Bayesian framework provides a generic solution to the problem of uniquely estimating transcript abundances given additional constraints that themselves may be uncertain, such as regression fit to probe sequence models. We demonstrate the efficacy of it by extensive simulations as well as various biological data.

PubMed Disclaimer

Figures

**Figure 1**
**Subset gene structure and the corresponding estimated concentrations**. (A) shows a 2-variant "subset" gene structure, where the genomic composition of variant 1 is a subset of variant 2. (B) shows the estimated concentrations of the two variants for 20 different initial values using Wang *et al*.'s method [36]. Each line indicates estimated concentrations with one set of initial value; all are globally optimal solutions that give the same RSS. See [36] for details.

**Figure 2**
**A simple two-variant gene structure and its estimated concentrations using Wang et al.'s method** [36]. (A) shows a simple two-variant gene structure with three exons. The two variants have one common exon and each variant contains one unique exon. (B) shows the estimated concentrations with 20 different initial values. The solution is unique for the two variants.

**Figure 3**
**Predicted T on simulation data without and with group constraints**. In the upper panel, we plotted the estimated concentrations of the three transcripts generated by Wang *et al*.'s deconvolution procedure without group constraints [36] (denoted by the circles). Note that every solution achieves a global minimum of the log-likelihood function, but they are quite different and none of them approximates the true T (denoted by the triangles). The lower panel shows the *unique* estimate using (four) group constraints; it is very close to the true T, even though the noise is large and there is some unknown background signal.

**Figure 4**
**The concentration differences between variant 1 and variant 2 for MAPT (HG-SV data): predicted vs. true**. The predicted concentration differences and the true differences are labeled by red crosses and green circles respectively. For comparison purpose, the estimates using the SPACE algorithm [39] are also plotted, represented by gray dots.

**Figure 5**
**Across-experiment ratios (on HG-LS data)**. The estimated values are denoted by crosses, while the true concentration ratios are denoted by circles. For each of the seven variants, we compare the across-experiment quantities (defined in (a) in the subsection of *HG-LS Data*) between the estimated and the true T.

formula image — **Figure 5**
**Across-experiment ratios (on HG-LS data)**. The estimated values are denoted by crosses, while the true concentration ratios are denoted by circles. For each of the seven variants, we compare the across-experiment quantities (defined in (a) in the subsection of *HG-LS Data*) between the estimated and the true T.

**Figure 6**
**Across-variant ratios (on HG-LS data)**. The estimated values are denoted by crosses, while the true concentration ratios are denoted by circles. For each of the nine experiments we compare the across-variant quantities (defined in (b) in the subsection of *HG-LS Data*) between the estimated and the true T.

**Figure 7**
**Across-variant-and-experiment ratios (on HG-LS data)**. The estimated values are denoted by crosses, while the true values are denoted by circles. We compare the diagonal and antidiagonal (defined in (c) in the subsection of *HG-LS Data*) between the estimated and the true T.

**Figure 8**
**A summary of all ratio comparisons (on HG-LS data)**. The x axis represents the true ratios of all three types (defined in the subsection of *HG-LS Data*); the y axis represents the results from our method denoted by red crosses, and from the SPACE algorithm by Anton *et al*. denoted by green triangles. The black dotted line is the identity line.

**Figure 9**
**Matrix representation of Wang et al.'s model** [36].

**Figure 10**
**Procedure outline for estimating transcript concentrations**. The probe sequence model is trained supervisedly on some data with T available. Then predict the (grouped) probe responsiveness on the new dataset to help recover all concentrations via the probe intensity model.

**Figure 11**
**Model training: supervised vs. unsupervised**. The figure gives a comparison of some response estimates. The first two are both *unsupervisedly* trained, with A and T unknown, from the PDNN model [43], and the probe selection model (PSM) [44], respectively. The WAM estimate is obtained by estimating A (with T known) from our model (7); a robust median estimate (the median of the ratios of Y to ***FGT***) is also plotted. Note that the first two deviate from the last two which are trained in a supervised manner.

**Figure 12**
**Probe responsiveness curves**. Four position functions are shown to reflect the nucleotides' difference in responsiveness. The baseline is 'T' on 25 positions of every probe. The three curves correspond to the probe responsiveness change if replacing 'T' by 'A', 'C', and 'G' respectively. These functions are estimated using smoothing splines.

**Figure 13**
**Trained probe responsiveness on HG-LS data**. We present 50 probes chosen at random. The true probe responsiveness is in solid lines while the value fitted with supervised training is in dashed lines. The fitted values deviate a lot from the true for most probes, which indicates poor goodness-of-fit of the probe responsiveness model. It is worth mentioning that most papers choose to display similar quantities on log scale. But to have a small error in T, A should be well predicted even on the original scale.

**Figure 14**
**A Bayesian framework for building transcript concentration estimation model**. The sequence based model is trained on the training data in a supervised way. Then for the validation data, group-constraints are constructed from the predicted probe responsiveness to remove deconvolution ambiguity as discussed earlier. 'Fuzzy' constraints can be considered in fitting a probe intensity model with standard errors included (see (11)). During the optimization, the exact group-constraints with no standard errors serve as an initial estimate.

See this image and copyright information in PMC

References

1. Johnson J, Castle J, Garrett-Engele P, Kan Z, Loerch P, Armour C, Santos R, Schadt E, Stoughton R, Shoemaker D. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003;302:2141–2144. doi: 10.1126/science.1090100. - DOI - PubMed
1. Coschigano K, Wensink P. Sex-specific transcriptional regulation by the male and female doublesex proteins of Drosophila. Genes Dev. 1993;7:42–45. doi: 10.1101/gad.7.1.42. - DOI - PubMed
1. Jiang Z, Wu J. Alternative splicing and programmed cell death. Proceedings of the Society for Experimental Biology and Medicine. 1999;220:64–72. doi: 10.1046/j.1525-1373.1999.d01-11.x. - DOI - PubMed
1. Black D. Protein Diversity from Alternative Splicing: A Challenge for Bioinformatics and Post-Genome Biology. Cell. 2000;103:367–370. doi: 10.1016/S0092-8674(00)00128-8. - DOI - PubMed
1. Breitbart R, Andreadis A, Nadal-Ginard B. Alternative Splicing: a Ubiquitous Mechanism for the Generation of Multiple Protein Isoforms from Single Genes. Annual Review of Biochemistry. 1987;56:467–495. doi: 10.1146/annurev.bi.56.070187.002343. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Resolving deconvolution ambiguity in gene alternative splicing

Affiliation

Resolving deconvolution ambiguity in gene alternative splicing

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources