Comparative Study

. 2008 Jan 10:9:12.

doi: 10.1186/1471-2105-9-12.

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Guy N Brock¹, John R Shaffer, Richard E Blakesley, Meredith J Lotz, George C Tseng

Affiliations

Affiliation

¹ Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, Universtiy of Louisville, Louisville, KY 40292, USA. guy.brock@louisville.edu

PMID: 18186917
PMCID: PMC2253514
DOI: 10.1186/1471-2105-9-12

Comparative Study

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Guy N Brock et al. BMC Bioinformatics. 2008.

. 2008 Jan 10:9:12.

doi: 10.1186/1471-2105-9-12.

Authors

Guy N Brock¹, John R Shaffer, Richard E Blakesley, Meredith J Lotz, George C Tseng

Affiliation

¹ Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, Universtiy of Louisville, Louisville, KY 40292, USA. guy.brock@louisville.edu

PMID: 18186917
PMCID: PMC2253514
DOI: 10.1186/1471-2105-9-12

Abstract

Background: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set.

Results: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost.

Conclusion: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.

PubMed Disclaimer

Figures

**Figure 1**
Average LRMSE values for different percentages of missing values in different microarray data sets.

**Figure 2**
Average LRMSE values for all imputation methods and all data sets, using the optimized parameter values and with 5% missing.

**Figure 3**
Plot of entropy vs. adjusted LRMSE values ( $LRMSE - {\hat{γ}}_{J}$ ) for each imputation method and each data set using Simulation II, with fitted regression lines.

See this image and copyright information in PMC

References

1. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001;17(6):520–525. doi: 10.1093/bioinformatics/17.6.520. - DOI - PubMed
1. Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003;19(16):2088–2096. doi: 10.1093/bioinformatics/btg287. - DOI - PubMed
1. Sehgal MS, Gondal I, Dooley LS. Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics. 2005;21(10):2417–2423. doi: 10.1093/bioinformatics/bti345. - DOI - PubMed
1. Gan X, Liew AW, Yan H. Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res. 2006;34(5):1608–1619. doi: 10.1093/nar/gkl047. - DOI - PMC - PubMed
1. Tuikkala J, Elo L, Nevalainen OS, Aittokallio T. Improving missing value estimation in microarray data with gene ontology. Bioinformatics. 2006;22(5):566–572. doi: 10.1093/bioinformatics/btk019. - DOI - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Affiliation

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources