The sbv IMPROVER Systems Toxicology Computational Challenge: Identification of Human and Species-Independent Blood Response Markers as Predictors of Smoking Exposure and Cessation Status

Vincenzo Belcastro¹, Carine Poussin¹, Yang Xiang¹, Maurizio Giordano², Kumar Parijat Tripathi², Akash Boda¹, Stéphanie Boué¹, Mario Guarracino², Florian Martin¹, Manuel C Peitsch¹, Julia Hoeng¹, Roberto Romero^{3

4

5

6}, Adi L Tarca^{7

8}, Zhongqu Duan^{9

10}, Hao Yang^{9

11}, Xiaofeng Gong^{9

10}, Peixuan Wang^{9

10}, Chenfang Zhang^{9

10}, Wenxin Yang^{9

11}, Omer Sinan Sarac¹², Ismail Bilgen¹², Ali Tugrul Balci¹², Rahul Kumar¹³, Sandeep Kumar Dhanda¹⁴

Affiliations

¹ PMI R&D, Philip Morris Products S.A. Quai Jeanrenaud 5, 2000 Neuchatel, Switzerland (part of Philip Morris International group of companies).
² Istituto di Calcolo e Reti ad Alte Prestazioni CNR, Via P. Castellino, 111 80131 Napoli, Italy.
³ Perinatology Research Branch, NICHD/NIH/DHHS, Bethesda, MD, and Detroit, MI, 48201, USA.
⁴ Department of Obstetrics and Gynecology, University of Michigan, Ann Arbor, MI, 48109, USA.
⁵ Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48825, USA.
⁶ Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, 48201, USA.
⁷ Department of Obstetrics and Gynecology, Wayne State University School of Medicine, Detroit, MI, USA.
⁸ Department of Computer Science, Wayne State University College of Engineering, Detroit, MI, USA.
⁹ SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China.
¹⁰ Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.
¹¹ School of Mathematics Sciences, Shanghai Jiao Tong University, Shanghai, China.
¹² Istanbul Technical University, Istanbul, Turkey.
¹³ Institute of Microbial Technology, Sector 39A, Chandigarh, 160036, India.
¹⁴ La Jolla Institute for Allergy and Immunology, 9420, Athena Circle, La Jolla, CA, 92037, USA.

PMID: 30221212
PMCID: PMC6136260
DOI: 10.1016/j.comtox.2017.07.004

The sbv IMPROVER Systems Toxicology Computational Challenge: Identification of Human and Species-Independent Blood Response Markers as Predictors of Smoking Exposure and Cessation Status

Vincenzo Belcastro et al. Comput Toxicol. 2018 Feb.

. 2018 Feb:5:38-51.

doi: 10.1016/j.comtox.2017.07.004. Epub 2017 Jul 14.

Authors

Affiliations

¹ PMI R&D, Philip Morris Products S.A. Quai Jeanrenaud 5, 2000 Neuchatel, Switzerland (part of Philip Morris International group of companies).
² Istituto di Calcolo e Reti ad Alte Prestazioni CNR, Via P. Castellino, 111 80131 Napoli, Italy.
³ Perinatology Research Branch, NICHD/NIH/DHHS, Bethesda, MD, and Detroit, MI, 48201, USA.
⁴ Department of Obstetrics and Gynecology, University of Michigan, Ann Arbor, MI, 48109, USA.
⁵ Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, 48825, USA.
⁶ Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MI, 48201, USA.
⁷ Department of Obstetrics and Gynecology, Wayne State University School of Medicine, Detroit, MI, USA.
⁸ Department of Computer Science, Wayne State University College of Engineering, Detroit, MI, USA.
⁹ SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China.
¹⁰ Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.
¹¹ School of Mathematics Sciences, Shanghai Jiao Tong University, Shanghai, China.
¹² Istanbul Technical University, Istanbul, Turkey.
¹³ Institute of Microbial Technology, Sector 39A, Chandigarh, 160036, India.
¹⁴ La Jolla Institute for Allergy and Immunology, 9420, Athena Circle, La Jolla, CA, 92037, USA.

PMID: 30221212
PMCID: PMC6136260
DOI: 10.1016/j.comtox.2017.07.004

Abstract

Cigarette smoking entails chronic exposure to a mixture of harmful chemicals that trigger molecular changes over time, and is known to increase the risk of developing diseases. Risk assessment in the context of 21^st century toxicology relies on the elucidation of mechanisms of toxicity and the identification of exposure response markers, usually from high-throughput data, using advanced computational methodologies. The sbv IMPROVER Systems Toxicology computational challenge (Fall 2015-Spring 2016) aimed to evaluate whether robust and sparse (≤40 genes) human (sub-challenge 1, SC1) and species-independent (sub-challenge 2, SC2) exposure response markers (so called gene signatures) could be extracted from human and mouse blood transcriptomics data of current (S), former (FS) and never (NS) smoke-exposed subjects as predictors of smoking and cessation status. Best-performing computational methods were identified by scoring anonymized participants' predictions. Worldwide participation resulted in 12 (SC1) and six (SC2) final submissions qualified for scoring. The results showed that blood gene expression data were informative to predict smoking exposure (i.e. discriminating smoker versus never or former smokers) status in human and across species with a high level of accuracy. By contrast, the prediction of cessation status (i.e. distinguishing FS from NS) remained challenging, as reflected by lower classification performances. Participants successfully developed inductive predictive models and extracted human and species-independent gene signatures, including genes with high consensus across teams. Post-challenge analyses highlighted "feature selection" as a key step in the process of building a classifier and confirmed the importance of testing a gene signature in independent cohorts to ensure the generalized applicability of a predictive model at a population-based level. In conclusion, the Systems Toxicology challenge demonstrated the feasibility of extracting a consistent blood-based smoke exposure response gene signature and further stressed the importance of independent and unbiased data and method evaluations to provide confidence in systems toxicology-based scientific conclusions.

Keywords: Systems toxicology; blood biomarkers; computational challenge; gene signature; smoking biomarker.

PubMed Disclaimer

Figures

**Figure 1. Overview of the Systems Toxicology computational challenge**
(a) Human and mouse blood samples were collected from smokers (S) and non-current smokers (NCS) (mouse: exposed and non-exposed) and gene expression was measured. Classification approaches were developed by the participants to identify exposed and non-exposed subjects. (b) Human and mouse samples were divided into training (H1 and M1a) and test (H2 and M2a) datasets. Training datasets and class labels were released to allow participant to train their models. Test datasets (including mock samples) were released in two subsets. Participants were asked to provide their predictions on the first subset before the second subset was released. Participants had to apply their models to assess the class labels for the samples in the test set.

**Figure 2. Participants’ prediction performances and final ranking**
(a, c) Participants’ scores (x-axis) relative to the null distribution (density curves) calculated from 10,000 random predictions. Dark blue and dark red (a-up, c-up) refer to the smoker (S) vs non-current smoker (NCS) task for area under precision recall (AUPR) and Matthew correlation coefficient (MCC), respectively. Blue and red (a-down, c-down) refer to former smoker (FS) vs never smoker (NS) task for AUPR and MCC, respectively. Blue and red circles identify participant’s scores. Vertical dashed lines indicate the scores with P-values of 0.05 (smaller P-values are on the right of the dashed line). (b, d) Bar plots showing the sum of ranks (y-axis, left scale) and the average rank (y-axis, right scale) across all metrics and tasks for all teams for SC1 (b) and SC2 (d). A lower sum of rank implies better performance.

**Figure 3. Exposure class predictions by top performers and across all teams**
Box plot showing the distribution of confidence scores (and median confidence scores for “All teams”) for samples belonging to different exposure classes. The higher (close to 1) the value the higher the confidence that a subject is a smoker. Low values imply high confidence that the subject is a non-current smoker (NCS; i.e., former smoker (FS) or never smoker (NS)). (a) SC1: Smoker (S) vs NCS confidence scores distributions for the three best-performing teams in SC1, and average confidence scores distribution for all teams. (b) SC1: FS vs NS confidence scores distributions for top three best-performing teams, and average confidence scores distribution of all team. (c) SC2: S vs NCS (human) and 3R4F (exposed) vs NCS (non-exposed) (mouse) confidence scores distributions for top three best-performing teams in SC2, and average confidence scores distribution of all teams. (d) SC2: FS vs NS (human) and cessation (Cess) vs Sham (mouse) confidence scores distributions for top three best-performing teams, and average confidence scores distribution of all teams. [t–test P-value, ‘.’ <0.1, ‘*’ <0.05, ‘**’ <0.01, ‘***’ <0.001, (‘−’ ≥0.1)].

**Figure 4. Sample misclassifications**
Sub-Challenge1 (SC1) (a, c) and SC2 (b, d) misclassifications shown as heatmaps. Teams are in columns arranged in decreasing order of performance from left to right. Subjects are in rows with the class label color as sidebar (smoker (S), former smoker (FS), never smoker (NS), cessation (Cess)). Rows were clustered per class according to the binary distance between rows. Cells in green correspond to subjects correctly classified, cells in ochre correspond to misclassifications. White cells indicate the absence of a prediction. Horizontal bars show the number of subjects misclassified in each row. (e) Number of years since a FS quit smoking (x-axis) vs number of times the FS was misclassified. Linear and quadratic model fitting are reported.

**Figure 5. Expression fold changes in the test dataset and co-occurrences of genes from consensus smoking exposure and cessation signatures**
Differential gene expression heatmaps for the test datasets for (a, b) SC1 and (c, d) SC2. Subjects are in columns and grouped per class. Smokers (S) are in red (3R4F in light red), former smokers (FS) are in green (cessation (Cess) in light green), and never smokers (NS) are in blue (Sham in light blue). The respective control groups are annotated as ctr. (a) SC1: S vs NCS (ctr: FS+NS). (c) SC2: S vs NCS and 3R4F vs NCS (ctr: Cess+Sham). (b) SC1: FS vs NS. (d) SC2: FS vs NS and Cess vs Sham. Lengths of horizontal bars are proportional to the number of times a gene is selected as part of a signature. Gray bars denote genes for which the fold change (FC) is statistically significant (FDR <0.05).

**Figure 6. Performance versus signature size and gene similarity**
(a) Matthew correlation coefficient (MCC) score versus gene signature size for cross-validation and test dataset. Features were selected from the list of (i) “Top” genes (orange), i.e., genes selected frequently by participants as part of the signature; (ii) “DEGs” (green), list of differentially expressed genes; (iii) “All Genes” (light blue), all measured genes. (b) MCC performance versus coefficient of similarity between genes in the signature. Seven different machine learning classifier were tested: (Random Forest (RF), support vector machine with linear kernel (svmLinear), partial least squares discriminant analysis (PLS), naive Bayes (NB), k-Nearest Neighbor (kNN), linear discriminant analysis (LDA), and logistic regression (LR)). (c) Distributions of MCC scores in CV (orange) and test set (green) data, plus distribution of the differences (light blue), for “Top” (top), “DEGs” (middle), and “All genes” (bottom) selections.

**Figure 7. Score versus signature size (≥25 genes)**
Performances of the ensemble learner (top four methods from cross-validation). Performance accuracy (blue), Area under the precision-recall (AUPR) curve (orange), and Matthew correlation coefficient (MCC) (gray) scores degraded when gene signature length increased above 40 genes.

See this image and copyright information in PMC

References

1. LaBreche HG, et al. Peripheral blood signatures of lead exposure. PLoS One. 2011;6(8):e23043. doi: 10.1371/journal.pone.0023043. - DOI - PMC - PubMed
1. Bushel PR, et al. Blood gene expression profiling of an early acetaminophen response. Pharmacogenomics J. 2016 doi: 10.1038/tpj.2016.8. - DOI - PMC - PubMed
1. Joseph P, Umbright C, Sellamuthu R. Blood transcriptomics: applications in toxicology. J Appl Toxicol. 2013;33(11):1193–202. doi: 10.1002/jat.2861. - DOI - PMC - PubMed
1. Thomas RS, et al. Incorporating new technologies into toxicity testing and risk assessment: moving from 21st century vision to a data-driven framework. Toxicol Sci. 2013;136(1):4–18. doi: 10.1093/toxsci/kft178. - DOI - PMC - PubMed
1. Farmer P, et al. A stroma-related gene signature predicts resistance to neoadjuvant chemotherapy in breast cancer. Nat Med. 2009;15(1):68–74. doi: 10.1038/nm.1908. - DOI - PubMed

Grants and funding

N01 HD023342/HD/NICHD NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The sbv IMPROVER Systems Toxicology Computational Challenge: Identification of Human and Species-Independent Blood Response Markers as Predictors of Smoking Exposure and Cessation Status

Affiliations

The sbv IMPROVER Systems Toxicology Computational Challenge: Identification of Human and Species-Independent Blood Response Markers as Predictors of Smoking Exposure and Cessation Status

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous