. 2013 Jul 15:14:223.

doi: 10.1186/1471-2105-14-223.

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Roger S Day¹, Kevin K McDade

Affiliations

PMID: 23855655
PMCID: PMC3734162
DOI: 10.1186/1471-2105-14-223

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Roger S Day et al. BMC Bioinformatics. 2013.

. 2013 Jul 15:14:223.

doi: 10.1186/1471-2105-14-223.

Authors

Roger S Day¹, Kevin K McDade

Affiliation

¹ Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA. day01@pitt.edu

PMID: 23855655
PMCID: PMC3734162
DOI: 10.1186/1471-2105-14-223

Abstract

Background: In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: "molecular identification" (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices.

Results: We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events.

Conclusions: The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.

PubMed Disclaimer

Figures

**Figure 1**
**Hypothetical mixture components for correlation.** Observed (black): marginal density of correlations. Mis-identified (red, dotted): density of correlations where either feature is mis-identified, or they are incorrectly mapped. Decoupled (green): density of correlations of pairs correctly mapped but biologically uncorrelated (“discordant”). Coupled (blue): density of correlations of pairs correctly mapped and biologically coupled.

**Figure 2**
**Distribution and mixture fit of observed correlations.** Black curve: a crude empirical density smooth for the observed correlations, which correspond to the “rug” whiskers along the bottom. Brown bimodal curve: the mixture fit to the underlying “true” correlation distribution, marginalized over the mixture component. Pink and green dotted curves: the mixture components, multiplied by their probabilities. Pink: decoupled (“0”) or mismatched (“x”). Green: coupled and correctly mapped (“+”).

**Figure 3**
**Bootstrap estimates of standard deviation of the correlations.** Each point shows the Pearson correlation between features of an ID pair, versus the square root of its bootstrap variance estimate (R=200 replications). The blue solid line is a loess smooth of these points. The red dotted line is from the normal theory expression for the variance (1?−?ρ²)/(n?−?3) of a Pearson correlation coefficient estimate $\hat{p}$ . The smooth fit for the relationship between the correlation and the bootstrap standard deviation follows the normal theory curve well except at large values, but the individual bootstrap estimates vary from the curve substantially.

**Figure 4**
**Relationship between posterior probability and posterior standard deviation.** The sizes of the circles are proportional to the measurement standard deviation. The curve is a density estimate for the posterior probability.

**Figure 5**
**Scatterplots of spectral counts versus microarray probeset signals.** Two probesets selected out of five mapped to the annexin 2 UniProt accession P07355. Symbols: N=non-cancer, S=serous carcinoma, E=endometrioid carcinoma. Figures are adapted from Day et al. [6].

**Figure 6**
**Effect of spectral count filtering by threshold on average expected utility.** Horizontal axis shows the proportion to be excluded for the two spectral count criteria. Vertical axis show the mean expected utility (averaging across pairs).

See this image and copyright information in PMC

Cited by

TGFA expression is associated with poor prognosis and promotes the development of cervical cancer.
Ma X, Zheng J, He K, Wang L, Wang Z, Wang K, Liu Z, San Z, Zhao L, Wang L. Ma X, et al. J Cell Mol Med. 2024 Feb;28(3):e18086. doi: 10.1111/jcmm.18086. Epub 2023 Dec 28. J Cell Mol Med. 2024. PMID: 38152044 Free PMC article.
Bioinformatics Based Drug Repurposing Approach for Breast and Gynecological Cancers: RECQL4/FAM13C Genes Address Common Hub Genes and Drugs.
Ayna Duran G. Ayna Duran G. Eur J Breast Health. 2025 Jan 1;21(1):63-73. doi: 10.4274/ejbh.galenos.2024.2024-11-2. Eur J Breast Health. 2025. PMID: 39744927 Free PMC article.
UNC93B1: a novel immune-related prognostic biomarker in breast cancer.
Tian L, Zeng H, Tian L, Wang H, Liu W. Tian L, et al. Discov Oncol. 2025 Jul 17;16(1):1352. doi: 10.1007/s12672-025-03124-8. Discov Oncol. 2025. PMID: 40673974 Free PMC article.
CXCR3 predicts the prognosis of endometrial adenocarcinoma.
Dong H, Sun M, Li H, Yue Y. Dong H, et al. BMC Med Genomics. 2023 Feb 7;16(1):20. doi: 10.1186/s12920-023-01451-9. BMC Med Genomics. 2023. PMID: 36750966 Free PMC article.
Identification of novel key genes associated with uterine corpus endometrial carcinoma progression and prognosis.
Li H, Zhou Q, Wu Z, Lu X. Li H, et al. Ann Transl Med. 2023 Jan 31;11(2):100. doi: 10.21037/atm-22-6461. Ann Transl Med. 2023. PMID: 36819577 Free PMC article.

See all "Cited by" articles

References

1. Kahlem P, Clegg A, Reisinger F, Xenarios I, Hermjakob H, Orengo C, Birney E. ENFIN–A European network for integrative systems biology. Comptes Rendus Biol. 2009;332:1050–1058. doi: 10.1016/j.crvi.2009.09.003. - DOI - PubMed
1. Pages H, Carlson M, Falcon S, Li N. AnnotationDbi: Annotation Database Interface. R package version 1.18.1. Bioconductor Release. 2012;2.11 http://www.bioconductor.org/packages/2.11/bioc/html/AnnotationDbi.html.
1. Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y. A computational method for assessing peptide- identification reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics. 2004;4:961–969. doi: 10.1002/pmic.200300656. - DOI - PubMed
1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. - DOI - PMC - PubMed
1. Lesk AM. Database annotation in molecular biology. Chichester, West Sussex: Hoboken, NJ; 2005.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Affiliation

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases