Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 Mar 15:7:137.
doi: 10.1186/1471-2105-7-137.

How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results

Affiliations
Comparative Study

How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results

Frank F Millenaar et al. BMC Bioinformatics. .

Abstract

Background: Short oligonucleotide arrays for transcript profiling have been available for several years. Generally, raw data from these arrays are analysed with the aid of the Microarray Analysis Suite or GeneChip Operating Software (MAS or GCOS) from Affymetrix. Recently, more methods to analyse the raw data have become available. Ideally all these methods should come up with more or less the same results. We set out to evaluate the different methods and include work on our own data set, in order to test which method gives the most reliable results.

Results: Calculating gene expression with 6 different algorithms (MAS5, dChip PMMM, dChip PM, RMA, GC-RMA and PDNN) using the same (Arabidopsis) data, results in different calculated gene expression levels. Consequently, depending on the method used, different genes will be identified as differentially regulated. Surprisingly, there was only 27 to 36% overlap between the different methods. Furthermore, 47.5% of the genes/probe sets showed good correlation between the mismatch and perfect match intensities.

Conclusion: After comparing six algorithms, RMA gave the most reproducible results and showed the highest correlation coefficients with Real Time RT-PCR data on genes identified as differentially expressed by all methods. However, we were not able to verify, by Real Time RT-PCR, the microarray results for most genes that were solely calculated by RMA. Furthermore, we conclude that subtraction of the mismatch intensity from the perfect match intensity results most likely in a significant underestimation for at least 47.5% of the expression values. Not one algorithm produced significant expression values for genes present in quantities below 1 pmol. If the only purpose of the microarray experiment is to find new candidate genes, and too many genes are found, then mutual exclusion of the genes predicted by contrasting methods can be used to narrow down the list of new candidate genes by 64 to 73%.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Venn diagram. Venn diagram of genes significant (Ttest p < 0.05) up or down regulated after three hours of ethylene exposure, depending of the method used to calculate gene expression. This diagram shows exactly the differences and similarities between all the methods. PDNN, MAS 5.0 (MAS, or GCOS), dChip PMMM (PMMM), dChip PM only (PM), RMA and GC-RMA were used. Only 790 genes were in common for all four algorithms. Comparable results were obtained from the low-light treatment. Areas with one letter shows genes which are unique for one method, areas with two letters shows genes which are only in common between these two methods, and so on.
Figure 2
Figure 2
Gene signal intensity. Gene signal intensity from control plants of all genes (22747) calculated with six methods, MAS 5.0 (MAS), dChip PMMM (PMMM), dChip PM only (PM), RMA, GC-RMA and PDNN. The signal intensity is sorted from low to high. Similar results where observed with expression data from treated plants.
Figure 3
Figure 3
Relation between signal intensity from MAS 5.0 and RMA. Relation between the signal intensity calculated with MAS 5.0 and RMA software of all probe sets. In general there is a good correlation (r2 = 0.9913), see also table 1. However, variation increased closer to the unity. For example a signal of 4 in MAS 5.0 results in a signal between 4 to 5.5 in RMA on a ln scale.
Figure 4
Figure 4
Spiked-in data. (A) Average observed ln intensity plotted against normalized ln concentration for 42 spiked-in genes of the Affymetrix spike-in experiment. The observed concentrations are adjusted so that all lines have the same intercept at a ln concentration of 2.8 (16 pmol). The solid line without symbols represents the ideal slope-1 line. (B) The accuracy of picking up the spiked-in genes. The significance between two successive spike-in concentrations (0–0.125; 0.125–0.25; etc.) was calculated for each gene. The number of genes where calculated per spike-in concentration that significantly where up regulated, and presented on the y-axis as percentage. This means that at "1" all 42 genes where significant at a given concentration.
Figure 5
Figure 5
Reproducibility. Reproducibility of expression data between three biological replicates (air), compared between MAS 5.0, dChip PM, dChip PMMM, RMA, GC-RMA and PDNN. Reproducibility is calculated as the standard deviation divided by the average signal, which is the coefficient of variation (CV). The CV values are sorted from low to high. The PM, RMA and PDNN algorithms are giving the best reproducible results and MAS 5.0 the worst. Reproducibility of the two other replicated treatments ethylene and low-light gave similar results (data not shown).
Figure 6
Figure 6
Examples of the relation between PM and MM signals. Relation between the PM and MM signals of four probe sets from all 9 arrays (A...D). Only the data point are plotted when the MM signal intensity is smaller than the PM signal. In panel A and B there is no correlation between the PM and MM signals as can been seen by the low slope and Pearson correlation coefficient. This in contrast to results in panel C and D were the slope and Pearson correlation coefficient are large. These signals are obtained from the microarray scanner and are the input for the six calculation methods.
Figure 7
Figure 7
Relation between PM and MM signals. Slope and Pearson correlation coefficient calculated between the PM and MM signals from 200 random chosen probe sets. Only probe sets are used which represents one gene. Both slope and correlation are sorted from low to high. See figure 6 for further explanation and individual examples.

References

    1. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Nat Acad Sci USA. 2001;98:31–36. doi: 10.1073/pnas.011404098. - DOI - PMC - PubMed
    1. Affymetrix Microarray Suite User Guide. Affymetrix. 2001;Version 5 http://www.affymetrix.com/support/technical/manuals.affx
    1. Li C, Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology. 2001;8:0032. - PMC - PubMed
    1. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix genechip probe level data. Nucleic Acids Research. 2003;31:e15. doi: 10.1093/nar/gng015. - DOI - PMC - PubMed
    1. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. - DOI - PubMed

MeSH terms

Substances

LinkOut - more resources