Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 1;36(9):2787-2795.
doi: 10.1093/bioinformatics/btaa064.

A Bayesian approach to accurate and robust signature detection on LINCS L1000 data

Affiliations

A Bayesian approach to accurate and robust signature detection on LINCS L1000 data

Yue Qiu et al. Bioinformatics. .

Abstract

Motivation: LINCS L1000 dataset contains numerous cellular expression data induced by large sets of perturbagens. Although it provides invaluable resources for drug discovery as well as understanding of disease mechanisms, the existing peak deconvolution algorithms cannot recover the accurate expression level of genes in many cases, inducing severe noise in the dataset and limiting its applications in biomedical studies.

Results: Here, we present a novel Bayesian-based peak deconvolution algorithm that gives unbiased likelihood estimations for peak locations and characterize the peaks with probability based z-scores. Based on the above algorithm, we build a pipeline to process raw data from L1000 assay into signatures that represent the features of perturbagen. The performance of the proposed pipeline is evaluated using similarity between the signatures of bio-replicates and the drugs with shared targets, and the results show that signatures derived from our pipeline gives a substantially more reliable and informative representation for perturbagens than existing methods. Thus, the new pipeline may significantly boost the performance of L1000 data in the downstream applications such as drug repurposing, disease modeling and gene function prediction.

Availability and implementation: The code and the precomputed data for LINCS L1000 Phase II (GSE 70138) are available at https://github.com/njpipeorgan/L1000-bayesian.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Illustration of pipeline for robust L1000 perturbagen signature detection
Fig. 2.
Fig. 2.
(a) The calibrated reference expression values of the invariant set compared with the original ones. The error bars show the ranges corresponding to ±1 SD. The data points are horizontally offset for better visibility. (b) The distribution of the slope and χ2 of the fitted relationship between unscaled expression values and calibrated reference values
Fig. 3.
Fig. 3.
The median and 68% confidence interval of the best fit DOF of Student’s t-distribution. Each data point corresponds to the DOFs of all peaks within its neighborhood (±0.01 in terms of log2 peak location). Our choice of three DOF is shown by a gray line
Fig. 4.
Fig. 4.
The relationship between the scale parameter of the Student’s t-distribution and the log2 expression level, i.e. the center of an isolated peak
Fig. 5.
Fig. 5.
Hexplots between the true peak and recovered peak positions from Bayesian MLE, k-means and AGMM. The darkness of the hexagon indicates the number of points in it. The Pearson correlation coefficient and MSE for each method are shown in each figure
Fig. 6.
Fig. 6.
Typical examples where discrepancies happens between Bayesian MLE, L1000 level 2 data and AGMM. For panel (a–d), the results from L1000 peak deconvolution and AGMM are shown in red and green arrows, where the thick and thin arrows indicate the peaks with high (2/3) and low (1/3) abundances, respectively. The results from our Bayesian method are shown as probability distributions in thick and thin blue curves, respectively. The peak locations used in Bayesian MLE are the fluorescent intensities where the probability distributions reach their maxima. The examples are from well REP.A028_MCF7_24H_X2_B25_D11
Fig. 7.
Fig. 7.
(a) Receiver operating characteristic (ROC) curves for replicates identification. Expression profiles from our Bayesian pipeline, Bayesian MLE and L1000 standard pipeline are tested with GSEA under different query sizes. The three methods are labeled as Bayesian, Bayesian MLE and L1000 level 4 in the figure. (b) The comparison of the median quantile (FPR at TPR=0.5)
Fig. 8.
Fig. 8.
(a) ROC curves for similar perturbagen recognition based on combined z-scores from Bayesian, Bayesian MLE and L1000 level 5 data. The area under the curve is also shown for each method. (b) The comparison of the median quantile (FPR at TPR = 0.5)

References

    1. Duan Q. et al. (2014) LINCS canvas browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res., 42, W449–W460. - PMC - PubMed
    1. Duan Q. et al. (2016) L1000CDS2: LINCS L1000 characteristic direction signatures search engine. NPJ Syst. Biol. Appl., 2, 16015. - PMC - PubMed
    1. Enache O.M. et al. (2018) The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices. Bioinformatics, 35, 1427–1429. - PMC - PubMed
    1. Filzen T.M. et al. (2017) Representing high throughput expression profiles via perturbation barcodes reveals compound targets. PLoS Comput. Biol., 13, e1005335. - PMC - PubMed
    1. Jin C., Malthouse E. (2015) On the bias and inconsistency of k-means clustering. doi: 10.13140/RG.2.1.4300.5528. Available at: https://www.researchgate.net/publication/287829457_On_the_bias_and_incon....

Publication types