Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 Mar 2:7:105.
doi: 10.1186/1471-2105-7-105.

The effect of oligonucleotide microarray data pre-processing on the analysis of patient-cohort studies

Affiliations
Comparative Study

The effect of oligonucleotide microarray data pre-processing on the analysis of patient-cohort studies

Roel G W Verhaak et al. BMC Bioinformatics. .

Abstract

Background: Intensity values measured by Affymetrix microarrays have to be both normalized, to be able to compare different microarrays by removing non-biological variation, and summarized, generating the final probe set expression values. Various pre-processing techniques, such as dChip, GCRMA, RMA and MAS have been developed for this purpose. This study assesses the effect of applying different pre-processing methods on the results of analyses of large Affymetrix datasets. By focusing on practical applications of microarray-based research, this study provides insight into the relevance of pre-processing procedures to biology-oriented researchers.

Results: Using two publicly available datasets, i.e., gene-expression data of 285 patients with Acute Myeloid Leukemia (AML, Affymetrix HG-U133A GeneChip) and 42 samples of tumor tissue of the embryonal central nervous system (CNS, Affymetrix HuGeneFL GeneChip), we tested the effect of the four pre-processing strategies mentioned above, on (1) expression level measurements, (2) detection of differential expression, (3) cluster analysis and (4) classification of samples. In most cases, the effect of pre-processing is relatively small compared to other choices made in an analysis for the AML dataset, but has a more profound effect on the outcome of the CNS dataset. Analyses on individual probe sets, such as testing for differential expression, are affected most; supervised, multivariate analyses such as classification are far less sensitive to pre-processing.

Conclusion: Using two experimental datasets, we show that the choice of pre-processing method is of relatively minor influence on the final analysis outcome of large microarray studies whereas it can have important effects on the results of a smaller study. The data source (platform, tissue homogeneity, RNA quality) is potentially of bigger importance than the choice of pre-processing method.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Correlation of expression values pre-processed by two methods. Pearson correlation coefficients of expression measurements calculated by two pre-processing procedures are shown on the y-axis, probe sets ranked by average expression level over the four pre-processing methods are shown on the x-axis. Contours indicate equal density, as estimated using a Gaussian kernel density estimate, with kernel width optimised by leave-one-out maximum-likelihood. A. AML dataset. B. CNS dataset.
Figure 2
Figure 2
Jaccard indices of clustering results. Results were obtained using correlation distance on a fixed number of probe sets, after different pre-processing procedures and by different clustering algorithms. A. AML dataset, k = 12 clusters, 3000 probe sets. B. CNS dataset, k = 5 clusters, 1000 probe sets.
Figure 3A
Figure 3A
Stability normalization of Jaccard index. Illustration of stability normalization for the Jaccard index of a particular k-means clustering (k = 12), obtained on MAS- and RMA-pre-processed versions of the AML dataset (correlation distance, 3000 probesets). The dotted line corresponds to the Jaccard index between these clusterings (0.55). For both MAS and RMA, the CDF can be used to arrive at a stability normalized Jaccard index; in this case 0.90 and 0.16. The arrows indicate the Jaccard indices for which the normalised Jaccard index JSN = 0.5. The interpretation is that for MAS, the comparison to RMA falls well within what can be expected, for RMA less so.
Figure 3B
Figure 3B
B: AML dataset: stability-normalized pairwise Jaccard indices of cluster labels assigned by the various methods. Clusterings into k = 12 clusters obtained using correlation distance on 3000 probe sets. Legend is shown in Figure 3D. For k-means, the grey bars indicate standard deviation over 10 repeated experiments.
Figure 3C
Figure 3C
CNS dataset: stability-normalized pairwise Jaccard indices of cluster labels assigned by the various methods. Clusterings into k = 5 clusters obtained using correlation distance on 1000 probe sets. Legend is shown in Figure 3D. For k-means, the grey bars indicate standard deviation over 10 repeated experiments.
Figure 3D
Figure 3D
D: Legend to markers in Figures 3B-C.

Similar articles

Cited by

References

    1. Affymetrix Microarray Suite User Guide. 2001.
    1. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A. 2001;98:31–36. doi: 10.1073/pnas.011404098. - DOI - PMC - PubMed
    1. Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. - DOI - PubMed
    1. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. doi: 10.1093/nar/gng015. - DOI - PMC - PubMed
    1. Naef F, Magnasco MO. Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys. 2003;68:11906. - PubMed

Publication types

MeSH terms

LinkOut - more resources