Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 Dec 12;7 Suppl 4(Suppl 4):S8.
doi: 10.1186/1471-2105-7-S4-S8.

The impact of sample imbalance on identifying differentially expressed genes

Affiliations
Comparative Study

The impact of sample imbalance on identifying differentially expressed genes

Kun Yang et al. BMC Bioinformatics. .

Abstract

Background: Recently several statistical methods have been proposed to identify genes with differential expression between two conditions. However, very few studies consider the problem of sample imbalance and there is no study to investigate the impact of sample imbalance on identifying differential expression genes. In addition, it is not clear which method is more suitable for the unbalanced data.

Results: Based on random sampling, two evaluation models are proposed to investigate the impact of sample imbalance on identifying differential expression genes. Using the proposed evaluation models, the performances of six famous methods are compared on the unbalanced data. The experimental results indicate that the sample imbalance has a great influence on selecting differential expression genes. Furthermore, different methods have very different performances on the unbalanced data. Among the six methods, the welch t-test appears to perform best when the size of samples in the large variance group is larger than that in the small one, while the Regularized t-test and SAM outperform others on the unbalanced data in other cases.

Conclusion: Two proposed evaluation models are effective and sample imbalance should be taken into account in microarray experiment design and gene expression data analysis. The results and two proposed evaluation models can provide some help in selecting suitable method to process the unbalanced data.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The results on prostate and liver datasets under the evaluation model 1. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 1, where the sizes of samples in Class C1 of the artificial data, which are created from the liver data and the prostate data, are all fixed at 60.
Figure 2
Figure 2
The results on prostate and liver datasets under the evaluation model 2. The expected Overlap Rates of six methods as well as their error limits on prostate and liver datasets under the evaluation model 2, where the number of overall samples in the artificial data from liver data is fixed at 120 and that from the prostate data is fixed at 60.
Figure 3
Figure 3
The expected performances of six methods on the simulated data with equal variances, i.e. σ1 = σ2= 0.5. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with equal variances (σ1 = σ2 = 0.5), where the number of samples of class C1 is fixed at 60 in the evaluation model 1 and the number of overall samples is fixed at 60 in the evaluation model 2.
Figure 4
Figure 4
The expected performances of six methods under the evaluation model 1 on the simulated data with unequal variances, where σ1 = 0.5, σ2 = 1. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with unequal variances (σ1 = 0.5, σ2 = 1) in the evaluation model 1, where the numbers of samples of class C1 and class C2 are fixed at 60, respectively.
Figure 5
Figure 5
The expected performances of six methods under the evaluation model 2 on the simulated data with unequal variances, where σ1 = 0.5, σ2 = 1 and n1 + n2 ≡ 60. The expected Precision Rates and Recall Rates of six methods as well as their error limits on the simulated data with unequal variances (σ1 = 0.5,σ2 = 1) in the evaluation model 2, where the number of overall samples is fixed at 60.
Figure 6
Figure 6
The average performance Regularized t-test minus the corresponding performance of Welch t-test on the simulated data with varied variance σ2, where n1 + n2 ≡ 60 and σ1 ≡ 0.5. The average Precision Rate and Recall Rate of Regularized t-test minus that of Welch t-test on the simulated data with varied variance σ2, where σ1 ≡ 0.5 and n1 + n2 ≡ 60.

Similar articles

Cited by

References

    1. Schene M, Shalon D, Davis RW, Brown PO. Quantitive monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–470. doi: 10.1126/science.270.5235.467. - DOI - PubMed
    1. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing J, Caligiuri MA, Bloomfield CD, Lander ES. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. - DOI - PubMed
    1. Petricoin EF, III, Hackett JL, Lesko LJ, Puri RK, Gutman SI, Chumakov K, Woodcock J, Feigal DW, Zoon KG, Sistare FD. Medical applications of microarray technologies: a regulatory science perspective. Nature Genetics. 2002;32:474–479. doi: 10.1038/ng1029. - DOI - PubMed
    1. Chen Y, Dougherty ER, Bittner ML. Ratio-based decisions and the quantitative analysis of cDNA microarray images. J Biomed Optics. 1997;2:364–367. doi: 10.1117/12.281504. - DOI - PubMed
    1. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics. 2002;18:546–554. doi: 10.1093/bioinformatics/18.4.546. - DOI - PubMed

LinkOut - more resources