. 2012 Jan 13:13:10.

doi: 10.1186/1471-2105-13-10.

MIPHENO: data normalization for high throughput metabolite analysis

Shannon M Bell¹, Lyle D Burgoon, Robert L Last

Affiliations

PMID: 22244038
PMCID: PMC3278354
DOI: 10.1186/1471-2105-13-10

MIPHENO: data normalization for high throughput metabolite analysis

Shannon M Bell et al. BMC Bioinformatics. 2012.

. 2012 Jan 13:13:10.

doi: 10.1186/1471-2105-13-10.

Authors

Shannon M Bell¹, Lyle D Burgoon, Robert L Last

Affiliation

¹ Quantitative Biology Program, Michigan State University, East Lansing, MI, USA.

PMID: 22244038
PMCID: PMC3278354
DOI: 10.1186/1471-2105-13-10

Abstract

Background: High throughput methodologies such as microarrays, mass spectrometry and plate-based small molecule screens are increasingly used to facilitate discoveries from gene function to drug candidate identification. These large-scale experiments are typically carried out over the course of months and years, often without the controls needed to compare directly across the dataset. Few methods are available to facilitate comparisons of high throughput metabolic data generated in batches where explicit in-group controls for normalization are lacking.

Results: Here we describe MIPHENO (Mutant Identification by Probabilistic High throughput-Enabled Normalization), an approach for post-hoc normalization of quantitative first-pass screening data in the absence of explicit in-group controls. This approach includes a quality control step and facilitates cross-experiment comparisons that decrease the false non-discovery rates, while maintaining the high accuracy needed to limit false positives in first-pass screening. Results from simulation show an improvement in both accuracy and false non-discovery rate over a range of population parameters (p < 2.2 × 10(-16)) and a modest but significant (p < 2.2 × 10(-16)) improvement in area under the receiver operator characteristic curve of 0.955 for MIPHENO vs 0.923 for a group-based statistic (z-score). Analysis of the high throughput phenotypic data from the Arabidopsis Chloroplast 2010 Project (http://www.plastid.msu.edu/) showed ~ 4-fold increase in the ability to detect previously described or expected phenotypes over the group based statistic.

Conclusions: Results demonstrate MIPHENO offers substantial benefit in improving the ability to detect putative mutant phenotypes from post-hoc analysis of large data sets. Additionally, it facilitates data interpretation and permits cross-dataset comparison where group-based controls are missing. MIPHENO is applicable to a wide range of high throughput screenings and the code is freely available as Additional file 1 as well as through an R package in CRAN.

PubMed Disclaimer

Figures

**Figure 1**
**Flowchart of MIPHENO**. "Input Data" (1) contains data with identifiable parameters for grouping/processing the data. The data pass through a quality control (QC) removal step (2), where groups not meeting the cut offs are identified and removed on an attribute-by-attribute basis. Data are normalized (3) using a scaling factor based on the data distribution. Putative hits are identified (4) using a CDF built from the data or user defined NULL distribution and an empirical p-value is assigned to each observation. Thresholds can be established based on follow-up capacity and prior knowledge (e.g. ability to detect known 'gold standard' mutant samples).

**Figure 2**
**Synthetic Populations used in Testing**. Synthetic data were generated to measure the performance of the three different methods in a case where 'ground truth' is known. Samples were randomly drawn from a low abundance population (Low, blue line), high abundance population (High, red line) or a WT population (WT, black line) as shown in the upper panels (A, C). Two population structures were sampled, one with a low probability of WT, P(WT = 0.4), and the other with a high probability of WT, P(WT) = 0.93, shown in the lower panels (B, C). To test the effect of population shape, equal relative standard deviation (RSD = 15%, A and B) or equal standard deviation (SD = 5, C and D) were independently tested.

**Figure 3**
**Performance of Methods on Synthetic Data: AUC**. The AUC was used to evaluate classification performance of MIPHENO, the use of raw data followed by a CDF classifier (RAW), and a group-based metric (Z) on synthetic data described in Figure 2. MIPHENO (pink, first in set) outperforms both RAW (green, middle) and Z (blue, left in set) across the different population parameters.

**Figure 4**
**Performance of Methods on Synthetic Data: Accuracy**. Accuracy of classification was used to compare the performance of MIPHENO, the use of raw data followed by a CDF classifier (RAW), and a group-based metric (Z) on synthetic data from populations described in Figure 2. The percent accuracy is plotted along the y-axis while the false discovery rate (FDR) cut off is along the x-axis. Each population distribution tested is shown in a separate panel. Note that MIPHENO (pink) achieved higher classification than Z (blue) (p < 2.2e-15, Wilcoxon sign rank) and both methods outperformed Raw (green) independent of the population parameters tested.

**Figure 5**
**Performance of Methods on Synthetic Data: False Non-Discovery Rate**. The false non-discovery rate (or percent positive hits missed) was used to compare the performance of MIPHENO, the use of raw data followed by a CDF classifier (RAW), and a group-based metric (Z) on synthetic data from populations described in Figure 2. The FNDR is plotted along the y-axis with the different false discovery rate (FDR) cut offs along the x-axis. Each population distribution is shown in a different panel. Note that across all populations tested, MIPHENO has a lower FNDR than the other two method, suggesting that fewer putative hits will missed with MIPHENO compared to using the Z-score (blue) or raw data (green).

**Figure 6**
**Flowchart of Performance Measures for Chloroplast 2010 Data**. Metabolite data from wild-type Col-0 ecotype samples were taken from the Chloroplast 2010 dataset. MIPHENO empirical p-values and z-scores were calculated separately for metabolite values reported as mol % and nmol/g fresh weight (nmol/gFW) and results filtered according to criteria. Publicly available annotation (Aracyc and GO, Additional file 1) for annotated genes provided a basis of comparison between the two metrics.

See this image and copyright information in PMC

Cited by

Analysis of Loss-of-Function Mutants in Aspartate Kinase and Homoserine Dehydrogenase Genes Points to Complexity in the Regulation of Aspartate-Derived Amino Acid Contents.
Clark TJ, Lu Y. Clark TJ, et al. Plant Physiol. 2015 Aug;168(4):1512-26. doi: 10.1104/pp.15.00364. Epub 2015 Jun 10. Plant Physiol. 2015. PMID: 26063505 Free PMC article.
BioHackathon 2015: Semantics of data for life sciences and reproducible research.
Vos RA, Katayama T, Mishima H, Kawano S, Kawashima S, Kim JD, Moriya Y, Tokimatsu T, Yamaguchi A, Yamamoto Y, Wu H, Amstutz P, Antezana E, Aoki NP, Arakawa K, Bolleman JT, Bolton E, Bonnal RJP, Bono H, Burger K, Chiba H, Cohen KB, Deutsch EW, Fernández-Breis JT, Fu G, Fujisawa T, Fukushima A, García A, Goto N, Groza T, Hercus C, Hoehndorf R, Itaya K, Juty N, Kawashima T, Kim JH, Kinjo AR, Kotera M, Kozaki K, Kumagai S, Kushida T, Lütteke T, Matsubara M, Miyamoto J, Mohsen A, Mori H, Naito Y, Nakazato T, Nguyen-Xuan J, Nishida K, Nishida N, Nishide H, Ogishima S, Ohta T, Okuda S, Paten B, Perret JL, Prathipati P, Prins P, Queralt-Rosinach N, Shinmachi D, Suzuki S, Tabata T, Takatsuki T, Taylor K, Thompson M, Uchiyama I, Vieira B, Wei CH, Wilkinson M, Yamada I, Yamanaka R, Yoshitake K, Yoshizawa AC, Dumontier M, Kosaki K, Takagi T. Vos RA, et al. F1000Res. 2020 Feb 24;9:136. doi: 10.12688/f1000research.18236.1. eCollection 2020. F1000Res. 2020. PMID: 32308977 Free PMC article.
Utility and Limitations of Using Gene Expression Data to Identify Functional Associations.
Uygun S, Peng C, Lehti-Shiu MD, Last RL, Shiu SH. Uygun S, et al. PLoS Comput Biol. 2016 Dec 9;12(12):e1005244. doi: 10.1371/journal.pcbi.1005244. eCollection 2016 Dec. PLoS Comput Biol. 2016. PMID: 27935950 Free PMC article.
Integrated LC-MS/MS system for plant metabolomics.
Sawada Y, Hirai MY. Sawada Y, et al. Comput Struct Biotechnol J. 2013 May 23;4:e201301011. doi: 10.5936/csbj.201301011. eCollection 2013. Comput Struct Biotechnol J. 2013. PMID: 24688692 Free PMC article. Review.
Functional Metabolomics Describes the Yeast Biosynthetic Regulome.
Mülleder M, Calvani E, Alam MT, Wang RK, Eckerstorfer F, Zelezniak A, Ralser M. Mülleder M, et al. Cell. 2016 Oct 6;167(2):553-565.e12. doi: 10.1016/j.cell.2016.09.007. Epub 2016 Sep 29. Cell. 2016. PMID: 27693354 Free PMC article.

See all "Cited by" articles

References

1. Quackenbush J. Microarray data normalization and transformation. Nat Genet. 2002;32:496–501. doi: 10.1038/ng1032. - DOI - PubMed
1. Eckel JE, Gennings C, Therneau TM, Burgoon LD, Boverhof DR, Zacharewski TR. Normalization of two-channel microarray experiments: a semiparametric approach. Bioinformatics. 2005;21(7):1078–1083. doi: 10.1093/bioinformatics/bti105. - DOI - PubMed
1. Ballman KV, Grill DE, Oberg AL, Therneau TM. Faster cyclic loess: normalizing RNA arrays via linear models. Bioinformatics. 2004;20(16):2778–2786. doi: 10.1093/bioinformatics/bth327. - DOI - PubMed
1. Mar JC, Kimura Y, Schroder K, Irvine KM, Hayashizaki Y, Suzuki H, Hume D, Quackenbush J. Data-driven normalization strategies for high-throughput quantitative RT-PCR. BMC Bioinformatics. 2009;10 - PMC - PubMed
1. Last RL, Jones AD, Shachar-Hill Y. Towards the plant metabolome and beyond. Nat Rev Mol Cell Biol. 2007;8(2):167–174. doi: 10.1038/nrm2098. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MIPHENO: data normalization for high throughput metabolite analysis

Affiliation

MIPHENO: data normalization for high throughput metabolite analysis

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources