. 2009 Dec 29:10:449.

doi: 10.1186/1471-2105-10-449.

A white-box approach to microarray probe response characterization: the BaFL pipeline

Kevin J Thompson¹, Hrishikesh Deshmukh, Jeffrey L Solka, Jennifer W Weller

Affiliations

PMID: 20040098
PMCID: PMC2804686
DOI: 10.1186/1471-2105-10-449

A white-box approach to microarray probe response characterization: the BaFL pipeline

Kevin J Thompson et al. BMC Bioinformatics. 2009.

. 2009 Dec 29:10:449.

doi: 10.1186/1471-2105-10-449.

Authors

Kevin J Thompson¹, Hrishikesh Deshmukh, Jeffrey L Solka, Jennifer W Weller

Affiliation

¹ Computer Science Dept, University of North Carolina at Charlotte, Charlotte, NC 28223, USA. kthom110@uncc.edu

PMID: 20040098
PMCID: PMC2804686
DOI: 10.1186/1471-2105-10-449

Abstract

Background: Microarrays depend on appropriate probe design to deliver the promise of accurate genome-wide measurement. Probe design, ideally, produces a unique probe-target match with homogeneous duplex stability over the complete set of probes. Much of microarray pre-processing is concerned with adjusting for non-ideal probes that do not report target concentration accurately. Cross-hybridizing probes (non-unique), probe composition and structure, as well as platform effects such as instrument limitations, have been shown to affect the interpretation of signal. Data cleansing pipelines seldom filter specifically for these constraints, relying instead on general statistical tests to remove the most variable probes from the samples in a study. This adjusts probes contributing to ProbeSet (gene) values in a study-specific manner. We refer to the complete set of factors as biologically applied filter levels (BaFL) and have assembled an analysis pipeline for managing them consistently. The pipeline and associated experiments reported here examine the outcome of comprehensively excluding probes affected by known factors on inter-experiment target behavior consistency.

Results: We present here a 'white box' probe filtering and intensity transformation protocol that incorporates currently understood factors affecting probe and target interactions; the method has been tested on data from the Affymetrix human GeneChip HG-U95Av2, using two independent datasets from studies of a complex lung adenocarcinoma phenotype. The protocol incorporates probe-specific effects from SNPs, cross-hybridization and low heteroduplex affinity, as well as effects from scanner sensitivity, sample batches, and includes simple statistical tests for identifying unresolved biological factors leading to sample variability. Subsequent to filtering for these factors, the consistency and reliability of the remaining measurements is shown to be markedly improved.

Conclusions: The data cleansing protocol yields reproducible estimates of a given probe or ProbeSet's (gene's) relative expression that translates across datasets, allowing for credible cross-experiment comparisons. We provide supporting evidence for the validity of removing several large classes of probes, and for our approaches for removing outlying samples. The resulting expression profiles demonstrate consistency across the two independent datasets. Finally, we demonstrate that, given an appropriate sampling pool, the method enhances the t-test's statistical power to discriminate significantly different means over sample classes.

PubMed Disclaimer

Figures

**Figure 1**
**Background Estimation**. Using the intensity (y-axis) of the set of low target-affinity probes (ΔG < -3.6 kcal/mol) over the complete set of samples in each experiment (x-axis), the median value approximates the scanner's lower limit specification: the value is 189 *f.u.* for the Bhattacharjee experiment arrays and 204 *f.u.* for the Stearman experiment arrays. The Stearman dataset has one obvious outlier (the final sample) which was detected and subsequently removed through our sample cleansing routines.

**Figure 2**
**Response patterns for cross-hybridizing probes**. The x-axis is the index of a probe within a ProbeSet; the y-axis is the mean intensity of a probe across the samples in a class. A1 and A2 show only cross-hybridizing probes for the two classes in each experiment. For a different gene, B1 and B2 show a complete set of probes, with filled circles indicating the cross-hybridizing probes. Cross-hybridizing probes are much less consistent in direction and extent of change between disease classes and across sample sets than the filtered probes (see Figure 5 for comparison).

**Figure 3**
**BaFL Sample Batch Analysis**. Graphical depiction of array/batch characteristics for samples in the Bhattacharjee experiment, with array-wide mean fluorescence and mean number of contributing probes, after exclusion of probes that lie outside the scanner detection linear range. The top pair (A1 and A2) shows the complete data set while B1 and B2 show the effect of removing batch 3 and outliers. The y-axis is the mean intensity per probe (A1 and B1) or the number of probes in the linear response range (A2 and B2) and the x-axis is number of samples; batches have been clustered together and are indicated by the numeral on the graph (there are no batches 2 or 9). The mean value across the samples is shown (solid line) and the 1^stand 2^ndstandard deviations from the mean are shown (dotted lines). The numeral shown indicates to which batch the sample belongs (10 is X), the color indicates disease class (blue = Adenocarcinoma, red = Normal, purple = Small Cell Carcinoma, green = Pulmonary, and orange = Squamous). The red circle emphasizes the divergent behaviour of batch 3 in both tests. The same analysis for the Stearman data is given as Additional file 2[19].

**Figure 4**
**Batch Summary of Cleansing Process**. Batch characteristics resulting from the affy package analysis. Boxplots (top row) and densities (bottom row) of the Bhattacharjee data: summary of batch intensities. The color of batch results is consistent in all graphs, as indicated in the key in the B panels. The y-axes are the log₂of the intensities of the probes in a batch. Panels A1 and B1 depict the completely unfiltered data set, including all probes and Batch 3: note the obvious offset in Batch 3 and the strong skew to the resulting distributions. Panels A2 and B2 show the effects of removing Batch 3 and additional outlying samples, but include all probes: the skew remains significant but no batches are outliers. Panels A3 and B3 show the output after both sample cleansing and BaFL probe filtering: the distribution is more normal but the tails have been truncated. Note that the total density scale in panel B3 is reduced relative to B1 and B2 because fewer probes are included and less variance is observed. Similar output for the Stearman experiment is provided as Additional file 3[19].

**Figure 5**
**Measurement Profile Consistency**. Measurement profile consistency for a probeset across experiments. One example is given for each of the two types of ProbeSet probe response classification categories: Significantly differentially expressed (DE) or not. Panel A1 is for a non DE ProbeSet in the Bhattacharjee experimental results and A2 is for the same ProbeSet in the Stearman experiment. B1 and B2 are for a DE ProbeSet in each experiment. Mean intensity for the probe in the sample class (fluorescent units) is on the y-axis and probe index within the given ProbeSet is given on the x-axis. Intensities are not on the same scale for the two experiments since the labelling was done independently; it is the patterns and relative intensities that are conserved.

**Figure 6**
**Cross-dataset Latent ProbeSet Structure**. Cross-dataset Latent ProbeSet Structure using BaFL produced values. Two-dimensional projection calculated with spectral method of Higgs et al., as derived from the LaFon method. Sample correlation values using differential expression (DE) gene responses were used as input. Each symbol represents a ProbeSet, both color and direction of arrow indicate change: up (gray)- or down (black)-regulation, or not significantly different (white). Panel A: the 940 ProbeSets that are DE in both experiments. These ProbeSets pass BaFL pipeline criteria and are categorized as DE according to both RMA and dCHIP output (but not always in the same direction). Dark Grey upward-pointing arrows indicate up-regulated genes in the adenocarcinoma samples relative to normal samples. Black downward-pointing arrows indicate down regulation in adenocarcinoma samples relative to normal samples. Open white circles in Panel A indicate ProbeSets that BaFL does not interpret as significantly differentially expressed, while the other methods do. Bottom graph: the subset from panel A of the 325 ProbeSets predicted to be significantly DE by BaFl. The tables from which the graphs are made are provided in the Supplementary Materials Web site for the article, Latent Structure [19].

**Figure 7**
**T-test Power Analysis**. The power (sorted on the y axis) for the three probe cleansing methodologies based on t-tests of 4200 ProbeSet values per array (on the x-axis), using the same set of arrays in all cases. Panel A shows the power achieved by the Bhattacharjee adenocarcinoma sample stratification (plotted for RMA, dCHIP, and BaFl), panel B graphs the same analysis results achieved for the normal samples. Panel C presents the calculated sample size required (on the y-axis) in order to achieve a power = 0.8 (at α = 0.05) presuming the ProbeSet variation is adequately reflecting the true variation from the Stearman measurements. Although all 4200 ProbeSets were used in the simulation, the x-axis is truncated to show the beginning of the rise coming off the baseline.

See this image and copyright information in PMC

Cited by

ArrayInitiative - a tool that simplifies creating custom Affymetrix CDFs.
Overall CC, Carr DA, Tabari ES, Thompson KJ, Weller JW. Overall CC, et al. BMC Bioinformatics. 2011 May 6;12:136. doi: 10.1186/1471-2105-12-136. BMC Bioinformatics. 2011. PMID: 21548938 Free PMC article.
The LO-BaFL method and ALS microarray expression analysis.
Baciu C, Thompson KJ, Mougeot JL, Brooks BR, Weller JW. Baciu C, et al. BMC Bioinformatics. 2012 Sep 24;13:244. doi: 10.1186/1471-2105-13-244. BMC Bioinformatics. 2012. PMID: 23006766 Free PMC article.
AnyExpress: integrated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm.
Kim J, Patel K, Jung H, Kuo WP, Ohno-Machado L. Kim J, et al. BMC Bioinformatics. 2011 Mar 17;12:75. doi: 10.1186/1471-2105-12-75. BMC Bioinformatics. 2011. PMID: 21410990 Free PMC article.

References

1. Barash Y, Dehan E, Krupsky M, Franklin W, Geraci M, Friedman N, Kaminski N. Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays. Bioinformatics. 2004;20(6):839–846. doi: 10.1093/bioinformatics/btg487. - DOI - PubMed
1. Fridlyand SDaJ. In: DNA Arrays Methods and Protocols. Rampal JB, editor. Vol. 170. Totoja, NJ: Humana Press; Introduction to Classification in Microarray Experiments; pp. 132–149.
1. Parmigiani ESGG, Irizarry RA, Zeger SL. The Analysis of Gene Expression Data. New York: Springer; 2003.
1. Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ. High density synthetic oligonucleotide arrays. Nat Genet. 1999;21(1 Suppl):20–24. doi: 10.1038/4447. - DOI - PubMed
1. Southern EM. DNA microarrays. History and overview. Methods Mol Biol. 2001;170:1–15. - PubMed

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A white-box approach to microarray probe response characterization: the BaFL pipeline

Affiliation

A white-box approach to microarray probe response characterization: the BaFL pipeline

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources