. 2007 Mar 29:8:109.

doi: 10.1186/1471-2105-8-109.

A meta-data based method for DNA microarray imputation

Rebecka Jörnsten¹, Ming Ouyang, Hui-Yu Wang

Affiliations

PMID: 17394658
PMCID: PMC1852325
DOI: 10.1186/1471-2105-8-109

A meta-data based method for DNA microarray imputation

Rebecka Jörnsten et al. BMC Bioinformatics. 2007.

. 2007 Mar 29:8:109.

doi: 10.1186/1471-2105-8-109.

Authors

Rebecka Jörnsten¹, Ming Ouyang, Hui-Yu Wang

Affiliation

¹ Department of Statistics, Rutgers, the State University of New Jersey, New Brunswick, NJ 08903, USA. rebecka@stat.rutgers.edu <rebecka@stat.rutgers.edu>

PMID: 17394658
PMCID: PMC1852325
DOI: 10.1186/1471-2105-8-109

Abstract

Background: DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting.

Results: We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones.

Conclusion: Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.

PubMed Disclaimer

Figures

**Figure 1**
Yeast data: Cross-validation of imputation RMSE of the database matrix. Each point corresponds to a column c of the matrix D, imputed by using 40 other columns from D that have the largest absolute values of Pearson correlation to c. The horizontal axis is the mean of absolute Pearson correlation of the 40 columns, and the vertical axis is mean RMSE of 30 independent runs.

**Figure 2**
A different view of the data in Figure 1, showing that the imputation RMSEs are stable for different proportions of missing values: Each point corresponds to a column. The (blue) line is x = y. The horizontal axis is the mean RMSE for 2% missing values, and the vertical axis is mean RMSE for 16% missing values.

**Figure 3**
Yeast data: Imputation RMSE when 5, 10, 20, 40, and 80 database columns with the largest absolute Pearson correlation to the column c are used to impute c with 16% missing values. The results of 15 randomly chosen c's are shown in the figure.

**Figure 4**
Yeast data: Imputation via the database is always better than imputation within logical sets. Each vertical line represents a logical set; the top of the line is mean RMSE of imputation within the set, and the bottom of the line is mean RMSE of imputation via the database. Some logical sets have the same numbers of columns (the horizontal axis), and their lines are drawn with slight offsets so that they do not overlap.

**Figure 5**
Yeast data: Imputation via the database matrix is always better than imputation via one external logical set. Each point is a column from the yeast database matrix. The horizontal axis is the RMSE of imputation via the H₂O₂set, which has 23 columns, and the vertical axis is the RMSE of imputation via the 23 columns selected from the database matrix using our method. The (blue) line is x = y. All points are below the line, indicating that the database approach is always superior.

**Figure 6**
Yeast data: RMSE of significant and non-significant genes. The most up-regulated ten percent genes and the most down-regulated ten percent genes in a column are designated as significant. With 16% missing values imputed by the database matrix, the significant genes generally have less RMSE than the non-significant ones. Each point in the figure corresponds to a column from the database matrix. The (blue) line is x = y; the (black) points below the line are samples where significant genes have less RMSE than non-significant ones; the (red) points above the line are samples where significant genes have more RMSE than non-significant ones.

**Figure 7**
Yeast data: Comparison of imputation RMSE of the meta-data approach and the logical-set approach. With 16% missing values, the meta-data approach generally have less RMSE than the logical-set approach. The legends are the same as Figure 6. The top panel is for significant genes, and the bottom panel is for non-significant ones. This figure offers detailed views of the data presented in Figure 4.

**Figure 8**
Worm and plant data: Comparison of imputation RMSE of significant genes by the meta-data approach and by the logical-set approach. With 16% missing values, the meta-data approach generally have less RMSE than the logical-set approach. The legends are the same as Figure 6.

See this image and copyright information in PMC

Cited by

How to improve postgenomic knowledge discovery using imputation.
Sehgal MS, Gondal I, Dooley LS, Coppel R. Sehgal MS, et al. EURASIP J Bioinform Syst Biol. 2009;2009(1):717136. doi: 10.1155/2009/717136. Epub 2009 Feb 8. EURASIP J Bioinform Syst Biol. 2009. PMID: 19223972 Free PMC article.
A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data.
Miecznikowski JC, Damodaran S, Sellers KF, Rabin RA. Miecznikowski JC, et al. Proteome Sci. 2010 Dec 15;8:66. doi: 10.1186/1477-5956-8-66. Proteome Sci. 2010. PMID: 21159180 Free PMC article.
An efficient ensemble method for missing value imputation in microarray gene expression data.
Zhu X, Wang J, Sun B, Ren C, Yang T, Ding J. Zhu X, et al. BMC Bioinformatics. 2021 Apr 13;22(1):188. doi: 10.1186/s12859-021-04109-4. BMC Bioinformatics. 2021. PMID: 33849444 Free PMC article.
Latent class based multiple imputation approach for missing categorical data.
Gebregziabher M, DeSantis SM. Gebregziabher M, et al. J Stat Plan Inference. 2010 Nov;140(11):3252-3262. doi: 10.1016/j.jspi.2010.04.020. J Stat Plan Inference. 2010. PMID: 30555206 Free PMC article.
An integrative imputation method based on multi-omics datasets.
Lin D, Zhang J, Li J, Xu C, Deng HW, Wang YP. Lin D, et al. BMC Bioinformatics. 2016 Jun 21;17:247. doi: 10.1186/s12859-016-1122-6. BMC Bioinformatics. 2016. PMID: 27329642 Free PMC article.

See all "Cited by" articles

References

1. Brown P, Botstein D. Exploring the new world of the genome with DNA microarrays. Nat Genet. 1999;21:33–7. doi: 10.1038/4462. - DOI - PubMed
1. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–8. doi: 10.1073/pnas.95.25.14863. [0027-8424 Journal Article] - DOI - PMC - PubMed
1. Chen X, Cheung S, So S, Fan S, Barry C, Higgins J, Lai K, Ji J, Dudoit S, Ng I, Van DRM, Botstein D, Brown P. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–39. doi: 10.1091/mbc.02-02-0023.. - DOI - PMC - PubMed
1. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17:520–5. doi: 10.1093/bioinformatics/17.6.520. - DOI - PubMed
1. Bar-Joseph Z, Gerber G, GifFord D, Jaakkola T, Simon I. Continuous representations of time-series gene expression data. J Comput Biol. 2003;10:341–56. doi: 10.1089/10665270360688057. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A meta-data based method for DNA microarray imputation

Affiliation

A meta-data based method for DNA microarray imputation

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases