Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
- PMID: 18186917
- PMCID: PMC2253514
- DOI: 10.1186/1471-2105-9-12
Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes
Abstract
Background: Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set.
Results: We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost.
Conclusion: Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.
Figures



Similar articles
-
Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.Bioinformatics. 2005 May 15;21(10):2417-23. doi: 10.1093/bioinformatics/bti345. Epub 2005 Feb 24. Bioinformatics. 2005. PMID: 15731210
-
DNA microarray data imputation and significance analysis of differential expression.Bioinformatics. 2005 Nov 15;21(22):4155-61. doi: 10.1093/bioinformatics/bti638. Epub 2005 Aug 23. Bioinformatics. 2005. PMID: 16118262
-
Ameliorative missing value imputation for robust biological knowledge inference.J Biomed Inform. 2008 Aug;41(4):499-514. doi: 10.1016/j.jbi.2007.10.005. Epub 2007 Dec 31. J Biomed Inform. 2008. PMID: 18334307
-
Missing value imputation for gene expression data: computational techniques to recover missing data from available information.Brief Bioinform. 2011 Sep;12(5):498-513. doi: 10.1093/bib/bbq080. Epub 2010 Dec 14. Brief Bioinform. 2011. PMID: 21156727 Review.
-
Dealing with missing values in large-scale studies: microarray data imputation and beyond.Brief Bioinform. 2010 Mar;11(2):253-64. doi: 10.1093/bib/bbp059. Epub 2009 Dec 4. Brief Bioinform. 2010. PMID: 19965979 Review.
Cited by
-
A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.Nucleic Acids Res. 2020 Dec 2;48(21):e125. doi: 10.1093/nar/gkaa881. Nucleic Acids Res. 2020. PMID: 33074331 Free PMC article.
-
New insights into handling missing values in environmental epidemiological studies.PLoS One. 2014 Sep 16;9(9):e104254. doi: 10.1371/journal.pone.0104254. eCollection 2014. PLoS One. 2014. PMID: 25226278 Free PMC article.
-
Shrinkage regression-based methods for microarray missing value imputation.BMC Syst Biol. 2013;7 Suppl 6(Suppl 6):S11. doi: 10.1186/1752-0509-7-S6-S11. Epub 2013 Dec 13. BMC Syst Biol. 2013. PMID: 24565159 Free PMC article.
-
Missing value imputation in high-dimensional phenomic data: imputable or not, and how?BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6. BMC Bioinformatics. 2014. PMID: 25371041 Free PMC article.
-
A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data.Proteome Sci. 2010 Dec 15;8:66. doi: 10.1186/1477-5956-8-66. Proteome Sci. 2010. PMID: 21159180 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources