A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
- PMID: 23855655
- PMCID: PMC3734162
- DOI: 10.1186/1471-2105-14-223
A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration
Abstract
Background: In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: "molecular identification" (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices.
Results: We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events.
Conclusions: The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.
Figures






Similar articles
-
Identifier mapping performance for integrating transcriptomics and proteomics experimental results.BMC Bioinformatics. 2011 May 27;12:213. doi: 10.1186/1471-2105-12-213. BMC Bioinformatics. 2011. PMID: 21619611 Free PMC article.
-
Improving Cancer Gene Expression Data Quality through a TCGA Data-Driven Evaluation of Identifier Filtering.Cancer Inform. 2015 Dec 16;14:149-61. doi: 10.4137/CIN.S33076. eCollection 2015. Cancer Inform. 2015. PMID: 26715829 Free PMC article.
-
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].Yi Chuan Xue Bao. 2004 May;31(5):431-43. Yi Chuan Xue Bao. 2004. PMID: 15478601 Chinese.
-
Platforms for biomarker analysis using high-throughput approaches in genomics, transcriptomics, proteomics, metabolomics, and bioinformatics.IARC Sci Publ. 2011;(163):121-42. IARC Sci Publ. 2011. PMID: 22997859 Review.
-
[Transcriptomes for serial analysis of gene expression].J Soc Biol. 2002;196(4):303-7. J Soc Biol. 2002. PMID: 12645300 Review. French.
Cited by
-
TGFA expression is associated with poor prognosis and promotes the development of cervical cancer.J Cell Mol Med. 2024 Feb;28(3):e18086. doi: 10.1111/jcmm.18086. Epub 2023 Dec 28. J Cell Mol Med. 2024. PMID: 38152044 Free PMC article.
-
Bioinformatics Based Drug Repurposing Approach for Breast and Gynecological Cancers: RECQL4/FAM13C Genes Address Common Hub Genes and Drugs.Eur J Breast Health. 2025 Jan 1;21(1):63-73. doi: 10.4274/ejbh.galenos.2024.2024-11-2. Eur J Breast Health. 2025. PMID: 39744927 Free PMC article.
-
UNC93B1: a novel immune-related prognostic biomarker in breast cancer.Discov Oncol. 2025 Jul 17;16(1):1352. doi: 10.1007/s12672-025-03124-8. Discov Oncol. 2025. PMID: 40673974 Free PMC article.
-
CXCR3 predicts the prognosis of endometrial adenocarcinoma.BMC Med Genomics. 2023 Feb 7;16(1):20. doi: 10.1186/s12920-023-01451-9. BMC Med Genomics. 2023. PMID: 36750966 Free PMC article.
-
Identification of novel key genes associated with uterine corpus endometrial carcinoma progression and prognosis.Ann Transl Med. 2023 Jan 31;11(2):100. doi: 10.21037/atm-22-6461. Ann Transl Med. 2023. PMID: 36819577 Free PMC article.
References
-
- Pages H, Carlson M, Falcon S, Li N. AnnotationDbi: Annotation Database Interface. R package version 1.18.1. Bioconductor Release. 2012;2.11 http://www.bioconductor.org/packages/2.11/bioc/html/AnnotationDbi.html.
-
- Lesk AM. Database annotation in molecular biology. Chichester, West Sussex: Hoboken, NJ; 2005.
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases