Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jul 15:14:223.
doi: 10.1186/1471-2105-14-223.

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Affiliations

A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration

Roger S Day et al. BMC Bioinformatics. .

Abstract

Background: In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: "molecular identification" (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices.

Results: We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events.

Conclusions: The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hypothetical mixture components for correlation. Observed (black): marginal density of correlations. Mis-identified (red, dotted): density of correlations where either feature is mis-identified, or they are incorrectly mapped. Decoupled (green): density of correlations of pairs correctly mapped but biologically uncorrelated (“discordant”). Coupled (blue): density of correlations of pairs correctly mapped and biologically coupled.
Figure 2
Figure 2
Distribution and mixture fit of observed correlations. Black curve: a crude empirical density smooth for the observed correlations, which correspond to the “rug” whiskers along the bottom. Brown bimodal curve: the mixture fit to the underlying “true” correlation distribution, marginalized over the mixture component. Pink and green dotted curves: the mixture components, multiplied by their probabilities. Pink: decoupled (“0”) or mismatched (“x”). Green: coupled and correctly mapped (“+”).
Figure 3
Figure 3
Bootstrap estimates of standard deviation of the correlations. Each point shows the Pearson correlation between features of an ID pair, versus the square root of its bootstrap variance estimate (R=200 replications). The blue solid line is a loess smooth of these points. The red dotted line is from the normal theory expression for the variance (1?−?ρ2)/(n?−?3) of a Pearson correlation coefficient estimate p^. The smooth fit for the relationship between the correlation and the bootstrap standard deviation follows the normal theory curve well except at large values, but the individual bootstrap estimates vary from the curve substantially.
Figure 4
Figure 4
Relationship between posterior probability and posterior standard deviation. The sizes of the circles are proportional to the measurement standard deviation. The curve is a density estimate for the posterior probability.
Figure 5
Figure 5
Scatterplots of spectral counts versus microarray probeset signals. Two probesets selected out of five mapped to the annexin 2 UniProt accession P07355. Symbols: N=non-cancer, S=serous carcinoma, E=endometrioid carcinoma. Figures are adapted from Day et al. [6].
Figure 6
Figure 6
Effect of spectral count filtering by threshold on average expected utility. Horizontal axis shows the proportion to be excluded for the two spectral count criteria. Vertical axis show the mean expected utility (averaging across pairs).

Similar articles

Cited by

References

    1. Kahlem P, Clegg A, Reisinger F, Xenarios I, Hermjakob H, Orengo C, Birney E. ENFIN–A European network for integrative systems biology. Comptes Rendus Biol. 2009;332:1050–1058. doi: 10.1016/j.crvi.2009.09.003. - DOI - PubMed
    1. Pages H, Carlson M, Falcon S, Li N. AnnotationDbi: Annotation Database Interface. R package version 1.18.1. Bioconductor Release. 2012;2.11 http://www.bioconductor.org/packages/2.11/bioc/html/AnnotationDbi.html.
    1. Razumovskaya J, Olman V, Xu D, Uberbacher EC, VerBerkmoes NC, Hettich RL, Xu Y. A computational method for assessing peptide- identification reliability in tandem mass spectrometry analysis with SEQUEST. Proteomics. 2004;4:961–969. doi: 10.1002/pmic.200300656. - DOI - PubMed
    1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. - DOI - PMC - PubMed
    1. Lesk AM. Database annotation in molecular biology. Chichester, West Sussex: Hoboken, NJ; 2005.

Publication types