Comparative Study

. 2006 Dec 12;7 Suppl 4(Suppl 4):S8.

doi: 10.1186/1471-2105-7-S4-S8.

The impact of sample imbalance on identifying differentially expressed genes

Kun Yang¹, Jianzhong Li, Hong Gao

Affiliations

PMID: 17217526
PMCID: PMC1780111
DOI: 10.1186/1471-2105-7-S4-S8

Comparative Study

The impact of sample imbalance on identifying differentially expressed genes

Kun Yang et al. BMC Bioinformatics. 2006.

. 2006 Dec 12;7 Suppl 4(Suppl 4):S8.

doi: 10.1186/1471-2105-7-S4-S8.

Authors

Kun Yang¹, Jianzhong Li, Hong Gao

Affiliation

¹ Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001, China. kunyang@hit.edu.cn

PMID: 17217526
PMCID: PMC1780111
DOI: 10.1186/1471-2105-7-S4-S8

Abstract

Background: Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data.

Results: Based on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases.

Conclusion: Two proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data.

PubMed Disclaimer

Figures

**Figure 1**
**The results on prostate and liver datasets under the evaluation model 1**. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 1, where the sizes of samples in Class C₁of the artificial data, which are created from the liver data and the prostate data, are all fixed at 60.

**Figure 2**
**The results on prostate and liver datasets under the evaluation model 2**. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 2, where the number of overall samples in the artificial data from liver data is fixed at 120 and that from the prostate data is fixed at 60.

**Figure 3**
**The expected performances of six methods on the simulated data with equal variances, i.e. σ₁= σ₂= 0.5**. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with equal variances (σ₁= σ₂= 0.5), where the number of samples of class C₁is fixed at 60 in the evaluation model 1 and the number of overall samples is fixed at 60 in the evaluation model 2.

**Figure 4**
**The expected performances of six methods under the evaluation model 1 on the simulated data with unequal variances, where σ₁= 0.5, σ₂= 1**. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with unequal variances (σ₁= 0.5, σ₂= 1) in the evaluation model 1, where the numbers of samples of class C₁and class C₂are fixed at 60, respectively.

**Figure 5**
**The expected performances of six methods under the evaluation model 2 on the simulated data with unequal variances, where σ₁= 0.5, σ₂= 1 and n₁+ n₂≡ 60**. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with unequal variances (σ₁= 0.5,σ₂= 1) in the evaluation model 2, where the number of overall samples is fixed at 60.

**Figure 6**
**The average performance Regularized t-test minus the corresponding performance of Welch t-test on the simulated data with varied variance σ₂, where n₁+ n₂≡ 60 and σ₁≡ 0.5**. The average Precision Rate and Recall Rate of Regularized t-test minus that of Welch t-test on the simulated data with varied variance σ₂, where σ₁≡ 0.5 and n₁+ n₂≡ 60.

See this image and copyright information in PMC

Cited by

Considerations for reproducible omics in aging research.
Singh PP, Benayoun BA. Singh PP, et al. Nat Aging. 2023 Aug;3(8):921-930. doi: 10.1038/s43587-023-00448-4. Epub 2023 Jun 29. Nat Aging. 2023. PMID: 37386258 Free PMC article. Review.
A robustness study of parametric and non-parametric tests in model-based multifactor dimensionality reduction for epistasis detection.
Mahachie John JM, Van Lishout F, Gusareva ES, Van Steen K. Mahachie John JM, et al. BioData Min. 2013 Apr 25;6(1):9. doi: 10.1186/1756-0381-6-9. BioData Min. 2013. PMID: 23618370 Free PMC article.
Modeling group heteroscedasticity in single-cell RNA-seq pseudo-bulk data.
You Y, Dong X, Wee YK, Maxwell MJ, Alhamdoosh M, Smyth GK, Hickey PF, Ritchie ME, Law CW. You Y, et al. Genome Biol. 2023 May 5;24(1):107. doi: 10.1186/s13059-023-02949-2. Genome Biol. 2023. PMID: 37147723 Free PMC article.
Diagnosing rare diseases after the exome.
Frésard L, Montgomery SB. Frésard L, et al. Cold Spring Harb Mol Case Stud. 2018 Dec 17;4(6):a003392. doi: 10.1101/mcs.a003392. Print 2018 Dec. Cold Spring Harb Mol Case Stud. 2018. PMID: 30559314 Free PMC article. Review.
Coevolution of prostate cancer and bone stroma in three-dimensional coculture: implications for cancer growth and metastasis.
Sung SY, Hsieh CL, Law A, Zhau HE, Pathak S, Multani AS, Lim S, Coleman IM, Wu LC, Figg WD, Dahut WL, Nelson P, Lee JK, Amin MB, Lyles R, Johnstone PA, Marshall FF, Chung LW. Sung SY, et al. Cancer Res. 2008 Dec 1;68(23):9996-10003. doi: 10.1158/0008-5472.CAN-08-2492. Cancer Res. 2008. PMID: 19047182 Free PMC article.

See all "Cited by" articles

References

1. Schene M, Shalon D, Davis RW, Brown PO. Quantitive monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. doi: 10.1126/science.270.5235.467. - DOI - PubMed
1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. - DOI - PubMed
1. Petricoin EF, III, Hackett JL, Lesko LJ, Puri RK, Gutman SI, Chumakov K, Woodcock J, Feigal DW, Zoon KG, Sistare FD. Medical applications of microarray technologies: a regulatory science perspective. Nature Genetics. 2002;32:474–479. doi: 10.1038/ng1029. - DOI - PubMed
1. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics. 1997;2:364–367. doi: 10.1117/12.281504. - DOI - PubMed
1. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002;18:546–554. doi: 10.1093/bioinformatics/18.4.546. - DOI - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The impact of sample imbalance on identifying differentially expressed genes

Affiliation

The impact of sample imbalance on identifying differentially expressed genes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources