Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug 1:8:46.
doi: 10.1186/s12920-015-0116-y.

Diagnostic biases in translational bioinformatics

Affiliations

Diagnostic biases in translational bioinformatics

Henry Han. BMC Med Genomics. .

Abstract

Background: With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.

Methods: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.

Results: In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.

Conclusions: Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The kernel matrices of the overfitting bias. The first row illustrates the box-plots of all pairwise sample distance squares in each data. The second row lists the kernel matrices of the three data sets under the ‘rbf’ kernel (σ=1), where each data is viewed as the population of training data, are identity matrices
Fig. 2
Fig. 2
The distributions of α values. The distributions of α values of each diagnostic trial in the 5-fold cross validation for three data sets. The skewness of sample label distribution leads to the skewness of the distributions of α values of the diagnoses of the BreastIBC and Kidney data sets. The signs of the α values indicate the group property of corresponding support vectors. As such, more support vectors can be found for the majority-count type, which will increase the likelihood of an unknown sample to be detected as the majority-count type in diagnosis
Fig. 3
Fig. 3
The comparisons of the kernel matrices in the label skewness and underfitting biases. The comparisons of the kernel matrices of the underfitting bias (‘mlp’ kernels) and those of the linear kernels for the three data sets. The linear kernel matrices appear to be normal ones though the label skewness bias happens to the BreastIBC and Kidney data
Fig. 4
Fig. 4
ROC plots. The ROC plots of DCA-SVM, SVM, PCA-SVM, ICA-SVM diagnoses under the 5-fold cross validation for the BreastIBC and Kidney data
Fig. 5
Fig. 5
The phenotype separation. The phenotype separation for four different data sets: GliomaRNASeq (LGG RNA-Seq), GliomaMiRNASeq (LGG MiRNA-Seq), Kidney (Kidney (KIRC) RNA-Seq), and HCC (HCC MALDI-TOF) by using the top three biomarkers

References

    1. Berger B, Peng J, Singh M. Computational solutions for omics data. Nat Rev Genet. 2013;14(5):333–46. doi: 10.1038/nrg3433. - DOI - PMC - PubMed
    1. Han H, Li XL, Ng SK, Ji Z. Multi-resolution-test for consistent phenotype discrimination and biomarker discovery in translational bioinformatics. J Bioinformatics Comput Biol. 2013;11(06):1343010. doi: 10.1142/S0219720013430105. - DOI - PubMed
    1. Nepomuceno-Chamorro I, Azuaje F, Devaux Y, Nazarov PV, Muller A, Aguilar-Ruiz JS, et al. Prognostic transcriptional association networks: a new supervised approach based on regression trees. Bioinformatics. 2011;27(2):252–8. doi: 10.1093/bioinformatics/btq645. - DOI - PMC - PubMed
    1. Nepomuceno-Chamorro I, Aguilar-Ruiz JS, Riquelme JC. Inferring gene regression networks with model trees. BMC Bioinformatics. 2010;11:517. doi: 10.1186/1471-2105-11-517. - DOI - PMC - PubMed
    1. Shah NH, Tenenbaum JD. The coming age of data-driven medicine: translational bioinformatics’ next frontier. J Am Med Inform Assoc. 2012;19:e2–e4. doi: 10.1136/amiajnl-2012-000969. - DOI - PMC - PubMed

Publication types

LinkOut - more resources