Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 15;30(10):1400-8.
doi: 10.1093/bioinformatics/btu039. Epub 2014 Jan 22.

The most informative spacing test effectively discovers biologically relevant outliers or multiple modes in expression

Affiliations

The most informative spacing test effectively discovers biologically relevant outliers or multiple modes in expression

Iwona Pawlikowska et al. Bioinformatics. .

Abstract

Summary: Several outlier and subgroup identification statistics (OASIS) have been proposed to discover transcriptomic features with outliers or multiple modes in expression that are indicative of distinct biological processes or subgroups. Here, we borrow ideas from the OASIS methods in the bioinformatics and statistics literature to develop the 'most informative spacing test' (MIST) for unsupervised detection of such transcriptomic features. In an example application involving 14 cases of pediatric acute megakaryoblastic leukemia, MIST more robustly identified features that perfectly discriminate subjects according to gender or the presence of a prognostically relevant fusion-gene than did seven other OASIS methods in the analysis of RNA-seq exon expression, RNA-seq exon junction expression and micorarray exon expression data. MIST was also effective at identifying features related to gender or molecular subtype in an example application involving 157 adult cases of acute myeloid leukemia.

Availability: MIST will be freely available in the OASIS R package at http://www.stjuderesearch.org/site/depts/biostats

Contact: stanley.pounds@stjude.org

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
OASIS Methods. (A) The LOO, LMS and SIBER methods. The x-coordinates of the points at the bottom show the observed data values. LOO leaves out one point (highlighted in light blue) and then fits a normal distribution to the remaining points (light blue curve). The value of the left-out observation is compared to the normal distribution obtained by leaving it out. LOO repeats this process for every observation. LMS identifies the narrowest interval that covers 50% of the observations (shown by the points highlighted in purple) and determines the normal distribution with central 50% that matches this interval (shown by the purple curve). LMS then compares all points to this normal distribution. SIBER fits two component mixtures of normal distributions (shown in brown) and then calculates distance between in a form of a BI. (B) The maximum spacing test (MAST) and DT methods. The black points show the observed data values (x-coordinate value) and their EDF (y-coordinate value). MAST determines the largest difference between consecutive ordered data values (shown by the red line) and compares its value to the distribution of the largest difference between consecutive ordered independent uniform(0,1) observations (data not shown). DT determines the cumulative distribution function of the best fitting non-parametric unimodal distribution (shown by the gray curve) and then determines the largest difference between the fitted curve and the EDF (shown by the dark blue line)
Fig. 2.
Fig. 2.
GLIS2 OASIS results for (A) RNA-seq exon read-count data, (B) RNA-seq junction read-count data, (C) exon array exon expression data and (D) exon array gene expression data. Each panel shows the results of DIP, MIST, MAST, SIBER, LMS and LOO for the indicated form of data. The DIP results are indicated by the vertical bar and two points in the right margin of the plot. The two dots indicate the vertical positions that define the dip statistic, the maximum vertical distance between the EDF (step function defined by black dots) and the best-fitting UDF (shown by the gray curve). The length of the bar corresponds to the value of the dip statistic that has formula image so that significance at the formula image level is indicated by the dots falling beyond the endpoints of the bar. The results of MIST, MAST, SIBER, LMS and LOO are shown in the bottom margin. For each of these methods, the length of the bar indicates the distance between the two points that defines significance at the formula image level. For MIST and MAST the two points correspond to the data values that define the spacing of interest. For SIBER, the points correspond to the estimated means of the two-component mixture model. The results of LMS and LOO are shown by 99% intervals and points falling outside those intervals were identified as outliers
Fig. 3.
Fig. 3.
Most significant RNA-seq exon by each OASIS method. (A) The data for the top hit according to DIP, LMS–SST, MAST and MIST; (B) The data for the top hit according to LMS–MOP. (C) The data for the top hit according to SIBER. (D) The data for the top hit according to LOO–MOP and LOO–SST. There is no result for SIBER in (D) since it is impossible to estimate variance of one data point (one data point on the right top corner)

References

    1. Allison DB, et al. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006;7:55–65. - PubMed
    1. Andrews DF, et al. Robust Estimates of Location: Survey and Advances. Princeton University Press: Princeton, NJ; 1972.
    1. Banfield JD, Raftery AE. Model-based gaussian and non-gaussian clustering. Biometrics. 1993;49:803–821.
    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., Series B. 1995;57:289–300.
    1. Casella G, Berger R. Statistical Inference. Duxbury Thomson Learning; Australia-Canada-Mexico-Singapore-Spain-United Kingdom-United States; 2001.

Publication types