. 2014 May 15;30(10):1400-8.

doi: 10.1093/bioinformatics/btu039. Epub 2014 Jan 22.

The most informative spacing test effectively discovers biologically relevant outliers or multiple modes in expression

Iwona Pawlikowska¹, Gang Wu, Michael Edmonson, Zhifa Liu, Tanja Gruber, Jinghui Zhang, Stan Pounds

Affiliations

Affiliation

¹ Departments of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, USA, Institue of Mathematics, University of Silesia, Katowice, Poland, Department of Computational Biology and Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN, USA.

PMID: 24458951
PMCID: PMC4068004
DOI: 10.1093/bioinformatics/btu039

The most informative spacing test effectively discovers biologically relevant outliers or multiple modes in expression

Iwona Pawlikowska et al. Bioinformatics. 2014.

. 2014 May 15;30(10):1400-8.

doi: 10.1093/bioinformatics/btu039. Epub 2014 Jan 22.

Authors

Iwona Pawlikowska¹, Gang Wu, Michael Edmonson, Zhifa Liu, Tanja Gruber, Jinghui Zhang, Stan Pounds

Affiliation

¹ Departments of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, USA, Institue of Mathematics, University of Silesia, Katowice, Poland, Department of Computational Biology and Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN, USA.

PMID: 24458951
PMCID: PMC4068004
DOI: 10.1093/bioinformatics/btu039

Abstract

Summary: Several outlier and subgroup identification statistics (OASIS) have been proposed to discover transcriptomic features with outliers or multiple modes in expression that are indicative of distinct biological processes or subgroups. Here, we borrow ideas from the OASIS methods in the bioinformatics and statistics literature to develop the 'most informative spacing test' (MIST) for unsupervised detection of such transcriptomic features. In an example application involving 14 cases of pediatric acute megakaryoblastic leukemia, MIST more robustly identified features that perfectly discriminate subjects according to gender or the presence of a prognostically relevant fusion-gene than did seven other OASIS methods in the analysis of RNA-seq exon expression, RNA-seq exon junction expression and micorarray exon expression data. MIST was also effective at identifying features related to gender or molecular subtype in an example application involving 157 adult cases of acute myeloid leukemia.

Availability: MIST will be freely available in the OASIS R package at http://www.stjuderesearch.org/site/depts/biostats

Contact: stanley.pounds@stjude.org

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
OASIS Methods. (A) The LOO, LMS and SIBER methods. The x-coordinates of the points at the bottom show the observed data values. LOO leaves out one point (highlighted in light blue) and then fits a normal distribution to the remaining points (light blue curve). The value of the left-out observation is compared to the normal distribution obtained by leaving it out. LOO repeats this process for every observation. LMS identifies the narrowest interval that covers 50% of the observations (shown by the points highlighted in purple) and determines the normal distribution with central 50% that matches this interval (shown by the purple curve). LMS then compares all points to this normal distribution. SIBER fits two component mixtures of normal distributions (shown in brown) and then calculates distance between in a form of a BI. (B) The maximum spacing test (MAST) and DT methods. The black points show the observed data values (x-coordinate value) and their EDF (y-coordinate value). MAST determines the largest difference between consecutive ordered data values (shown by the red line) and compares its value to the distribution of the largest difference between consecutive ordered independent uniform(0,1) observations (data not shown). DT determines the cumulative distribution function of the best fitting non-parametric unimodal distribution (shown by the gray curve) and then determines the largest difference between the fitted curve and the EDF (shown by the dark blue line)

**Fig. 2.**
GLIS2 OASIS results for (A) RNA-seq exon read-count data, (B) RNA-seq junction read-count data, (C) exon array exon expression data and (D) exon array gene expression data. Each panel shows the results of DIP, MIST, MAST, SIBER, LMS and LOO for the indicated form of data. The DIP results are indicated by the vertical bar and two points in the right margin of the plot. The two dots indicate the vertical positions that define the dip statistic, the maximum vertical distance between the EDF (step function defined by black dots) and the best-fitting UDF (shown by the gray curve). The length of the bar corresponds to the value of the dip statistic that has so that significance at the level is indicated by the dots falling beyond the endpoints of the bar. The results of MIST, MAST, SIBER, LMS and LOO are shown in the bottom margin. For each of these methods, the length of the bar indicates the distance between the two points that defines significance at the level. For MIST and MAST the two points correspond to the data values that define the spacing of interest. For SIBER, the points correspond to the estimated means of the two-component mixture model. The results of LMS and LOO are shown by 99% intervals and points falling outside those intervals were identified as outliers

formula image — **Fig. 2.**
GLIS2 OASIS results for (A) RNA-seq exon read-count data, (B) RNA-seq junction read-count data, (C) exon array exon expression data and (D) exon array gene expression data. Each panel shows the results of DIP, MIST, MAST, SIBER, LMS and LOO for the indicated form of data. The DIP results are indicated by the vertical bar and two points in the right margin of the plot. The two dots indicate the vertical positions that define the dip statistic, the maximum vertical distance between the EDF (step function defined by black dots) and the best-fitting UDF (shown by the gray curve). The length of the bar corresponds to the value of the dip statistic that has so that significance at the level is indicated by the dots falling beyond the endpoints of the bar. The results of MIST, MAST, SIBER, LMS and LOO are shown in the bottom margin. For each of these methods, the length of the bar indicates the distance between the two points that defines significance at the level. For MIST and MAST the two points correspond to the data values that define the spacing of interest. For SIBER, the points correspond to the estimated means of the two-component mixture model. The results of LMS and LOO are shown by 99% intervals and points falling outside those intervals were identified as outliers

**Fig. 3.**
Most significant RNA-seq exon by each OASIS method. (A) The data for the top hit according to DIP, LMS–SST, MAST and MIST; (B) The data for the top hit according to LMS–MOP. (C) The data for the top hit according to SIBER. (D) The data for the top hit according to LOO–MOP and LOO–SST. There is no result for SIBER in (D) since it is impossible to estimate variance of one data point (one data point on the right top corner)

See this image and copyright information in PMC

References

1. Allison DB, et al. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006;7:55–65. - PubMed
1. Andrews DF, et al. Robust Estimates of Location: Survey and Advances. Princeton University Press: Princeton, NJ; 1972.
1. Banfield JD, Raftery AE. Model-based gaussian and non-gaussian clustering. Biometrics. 1993;49:803–821.
1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., Series B. 1995;57:289–300.
1. Casella G, Berger R. Statistical Inference. Duxbury Thomson Learning; Australia-Canada-Mexico-Singapore-Spain-United Kingdom-United States; 2001.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The most informative spacing test effectively discovers biologically relevant outliers or multiple modes in expression

Affiliation

The most informative spacing test effectively discovers biologically relevant outliers or multiple modes in expression

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases