Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Mar 29:8:109.
doi: 10.1186/1471-2105-8-109.

A meta-data based method for DNA microarray imputation

Affiliations

A meta-data based method for DNA microarray imputation

Rebecka Jörnsten et al. BMC Bioinformatics. .

Abstract

Background: DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting.

Results: We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones.

Conclusion: Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Yeast data: Cross-validation of imputation RMSE of the database matrix. Each point corresponds to a column c of the matrix D, imputed by using 40 other columns from D that have the largest absolute values of Pearson correlation to c. The horizontal axis is the mean of absolute Pearson correlation of the 40 columns, and the vertical axis is mean RMSE of 30 independent runs.
Figure 2
Figure 2
A different view of the data in Figure 1, showing that the imputation RMSEs are stable for different proportions of missing values: Each point corresponds to a column. The (blue) line is x = y. The horizontal axis is the mean RMSE for 2% missing values, and the vertical axis is mean RMSE for 16% missing values.
Figure 3
Figure 3
Yeast data: Imputation RMSE when 5, 10, 20, 40, and 80 database columns with the largest absolute Pearson correlation to the column c are used to impute c with 16% missing values. The results of 15 randomly chosen c's are shown in the figure.
Figure 4
Figure 4
Yeast data: Imputation via the database is always better than imputation within logical sets. Each vertical line represents a logical set; the top of the line is mean RMSE of imputation within the set, and the bottom of the line is mean RMSE of imputation via the database. Some logical sets have the same numbers of columns (the horizontal axis), and their lines are drawn with slight offsets so that they do not overlap.
Figure 5
Figure 5
Yeast data: Imputation via the database matrix is always better than imputation via one external logical set. Each point is a column from the yeast database matrix. The horizontal axis is the RMSE of imputation via the H2O2 set, which has 23 columns, and the vertical axis is the RMSE of imputation via the 23 columns selected from the database matrix using our method. The (blue) line is x = y. All points are below the line, indicating that the database approach is always superior.
Figure 6
Figure 6
Yeast data: RMSE of significant and non-significant genes. The most up-regulated ten percent genes and the most down-regulated ten percent genes in a column are designated as significant. With 16% missing values imputed by the database matrix, the significant genes generally have less RMSE than the non-significant ones. Each point in the figure corresponds to a column from the database matrix. The (blue) line is x = y; the (black) points below the line are samples where significant genes have less RMSE than non-significant ones; the (red) points above the line are samples where significant genes have more RMSE than non-significant ones.
Figure 7
Figure 7
Yeast data: Comparison of imputation RMSE of the meta-data approach and the logical-set approach. With 16% missing values, the meta-data approach generally have less RMSE than the logical-set approach. The legends are the same as Figure 6. The top panel is for significant genes, and the bottom panel is for non-significant ones. This figure offers detailed views of the data presented in Figure 4.
Figure 8
Figure 8
Worm and plant data: Comparison of imputation RMSE of significant genes by the meta-data approach and by the logical-set approach. With 16% missing values, the meta-data approach generally have less RMSE than the logical-set approach. The legends are the same as Figure 6.

Similar articles

Cited by

References

    1. Brown P, Botstein D. Exploring the new world of the genome with DNA microarrays. Nat Genet. 1999;21:33–7. doi: 10.1038/4462. - DOI - PubMed
    1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–8. doi: 10.1073/pnas.95.25.14863. [0027-8424 Journal Article] - DOI - PMC - PubMed
    1. Chen X, Cheung S, So S, Fan S, Barry C, Higgins J, Lai K, Ji J, Dudoit S, Ng I, Van DRM, Botstein D, Brown P. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–39. doi: 10.1091/mbc.02-02-0023.. - DOI - PMC - PubMed
    1. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–5. doi: 10.1093/bioinformatics/17.6.520. - DOI - PubMed
    1. Bar-Joseph Z, Gerber G, GifFord D, Jaakkola T, Simon I. Continuous representations of time-series gene expression data. J Comput Biol. 2003;10:341–56. doi: 10.1089/10665270360688057. - DOI - PubMed

Publication types

MeSH terms

LinkOut - more resources