Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 3;11(1):3877.
doi: 10.1038/s41467-020-17678-4.

A deep learning model to predict RNA-Seq expression of tumours from whole slide images

Affiliations

A deep learning model to predict RNA-Seq expression of tumours from whole slide images

Benoît Schmauch et al. Nat Commun. .

Abstract

Deep learning methods for digital pathology analysis are an effective way to address multiple clinical questions, from diagnosis to prediction of treatment outcomes. These methods have also been used to predict gene mutations from pathology images, but no comprehensive evaluation of their potential for extracting molecular features from histology slides has yet been performed. We show that HE2RNA, a model based on the integration of multiple data modes, can be trained to systematically predict RNA-Seq profiles from whole-slide images alone, without expert annotation. Through its interpretable design, HE2RNA provides virtual spatialization of gene expression, as validated by CD3- and CD20-staining on an independent dataset. The transcriptomic representation learned by HE2RNA can also be transferred on other datasets, even of small size, to increase prediction performance for specific molecular phenotypes. We illustrate the use of this approach in clinical diagnosis purposes such as the identification of tumors with microsatellite instability.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: Employment: B.S., A.R., E.P., C.S., A.K., M.S., S.T., M.Z., T.C, M.M., P.C., G.W. are employed by Owkin, Inc. Advisory: J.C. reports consulting fees at Owkin, Inc. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Graphical abstract: Transcriptomic learning for digital pathology.
Hematoxylin & eosin (H&E)-stained histology slides and RNA-Seq data (FPKM-UQ values) for 28 different cancer types and 8725 patients were collected from The Cancer Genome Atlas (TCGA) and used to train the neural network HE2RNA to predict transcriptomic profile from the corresponding high-definition whole-slide images (WSI). During this task, the neural network learned an internal representation encoding both information from tiled images and gene expression levels. This transcriptomic representation can be used for: (1) transcriptome prediction from images without associated RNA sequencing. (2) The virtual spatialization of transcriptomic data. For each predicted coding or noncoding gene, a score is calculated for each tile on the corresponding WSI, which can be interpreted as the predicted gene expression for this tile (even though the real value is available only for the slide). These predictive scores can be used to generate heatmaps for each gene for which expression is significantly predicted. (3) Improving predictive performances for different tasks, in a transfer learning framework, as shown here for a realistic setup, for microsatellite instability (MSI) status prediction from non-annotated WSIs. Scale bar: 5 mm.
Fig. 2
Fig. 2. Gene expression prediction results.
a Distribution of Pearson correlation coefficients R averaged over the five folds of cross-validation (left axis, blue violin plots) and number of coding and noncoding genes (right axis, red squares) with Holm–Šidák corrected p-values < 0.05 (one-sided empirical p-value, as described in “Methods”), for 28 cancer types from the TCGA. Black triangles indicate the minimum correlation coefficient required for significance in any given dataset. b The number of coding and noncoding genes significantly well-predicted for a given number of cancer types, with Holm–Šidák corrected p-values < 0.05, as a function of the number of cancers. c Computational pathway analysis with ingenuity pathway analysis (IPA) software of the 156 best-predicted Pan TCGA genes, showing an enrichment in genes associated with immunity and tumor immune infiltration/activity. TCR: T-cell receptor, NK: natural killer. d IPA-based analysis of the more accurately predicted protein-coding genes in the LIHC dataset, showing an enrichment in genes associated with cell cycle and DNA damage response. LIHC: liver hepatocellular carcinoma. e IPA-based analysis of the more accurately predicted protein-coding genes in the BRCA dataset, showing an enrichment in genes associated with cell cycle and DNA damage response. Th cell differentiation: Th1 and Th2 cell differentiation. Th activation pathway: Th1 and Th2 activation pathway, 1ry: Primary. In ce red dashed line = −log(p-values = 0.05), BRCA: breast cancer, p-values were calculated using right-tailed Fisher’s exact test.
Fig. 3
Fig. 3. Prediction of signatures for cancer hallmarks.
a Comparison of correlation scores Rp for each gene pathway defined in Supplementary Table 4 and involved in angiogenesis, hypoxia, DNA repair, cell cycle, and immune responses mediated by B and T cells, with the mean correlation coefficient R0 obtained for 10,000 random lists of the same number of genes, for all 28 cancer types from the TCGA dataset. The indicated statistical significance refers to the probability of obtaining a correlation R > Rp in the distribution of correlations for random lists, for each given cancer type. Insets show the percentages (%) of the different cases of statistical significance between cancer types. The dashed line is the identity line Rp = R0. b As in a,, but in terms of percentage of genes considered well-predicted (as defined in the text and in Fig. 2). One-sided empirical p-values computed as described in “Methods” (circle: p = ns, star: p < 0.05, cross: p < 0.01, triangle: p < 0.001, square: p < 0.0001).
Fig. 4
Fig. 4. Virtual spatialization of CD3 and CD20 expression, confirmed by immunohistochemistry.
a Top left inset: H&E-stained slides were obtained from a LIHC patient. Main top image: The corresponding heatmap of the CD3-encoding genes expression predicted by our model. Main bottom image: CD3 immunohistochemistry (IHC) results obtained by washing out H&E stain and staining the same slide for IHC. b Pearson’s coefficient (R = 0.51, p-value < 10−4, two-tailed Student’s t test) for the correlation between the CD3 expression predicted by our model and the percentage of CD3+ cells actually detected on the IHC slide. The red dashed line indicates the average predicted expression per tile as a function of the number of CD3+ cells; shaded area: s.d. The vertical dashed line indicates the median number of CD3+ cells per tile, and the dotted line the 3rd quartile. c, d Same as in a and b using CD19 and CD20 coding genes and one CD20 IHC (R = 0.23, p-value < 10−4, two-tailed Student’s t test). Red dashed line: average prediction per tile as a function of the number of CD20+ cells; shaded area: s.d. Vertical dotted line: 95th percentile of the number of CD20+ cells per tile (median = 0). e ROC curves for distinguishing tiles from the HE/CD3 slide with a number of T cells above a given threshold (with threshold values corresponding to the 75th, 90th, 95th, and 99th percentile of the number of T cells per tile), obtained by applying both the T-cell model and the B-cell model. The dashed line is the expected ROC curve from a random classifier. f Same as in e for tiles from the HE/CD20 slide with a number of B cells above a given threshold. Scale bars: 5 mm. One slide was double-stained for each IHC marker.
Fig. 5
Fig. 5. Virtual spatialization of epithelium-associated genes (TP63, KRT8, and KRT18) and MKI67 expression.
a Representative H&E slide from the PESO dataset (n = 62 slides with segmentation mask). b. Heatmap for the expression of TP63, KRT8, and KRT18 predicted by HE2RNA for the slide in a. c Tiles with highest predicted expression for those genes on this slide, with the segmentation mask of epithelium obtained from an IHC staining of the same slide. S = Score corresponding to the log expression score of each tile; %e = fraction of pixels marked as belonging to the epithelium (n = 21,714 tiles in total). d Spatialization of MKI67 predicted expression on a liver hepatocarcinoma sample from an early stage tumor (BCLC stage A) (n = 284 samples). Left panel: Representative H&E staining, with annotation of tumor (T) and nontumor (N) areas performed by a pathologist. Right panel: Heatmap for the expression of MKI67 predicted by the model. e Same as d, for a sample from an advanced tumor (BCLC stage C) (n = 65 samples). In a, d, e, scale bar: 5 mm. In c, scale bar: 100 µm. d, e are representative slides (n = 369 annotated slides). BCLC Barcelona Clinic Liver Cancer.
Fig. 6
Fig. 6. Prediction of microsatellite instability status using transfer learning from transcriptomic representation.
a Distribution of Pearson correlation coefficients on TCGA-COAD and TCGA-STAD, for microsatellite-stable (MSS) patients (green) or patients with high-level microsatellite instability (MSI-H) (orange). Black triangles (respectively grey squares): minimum correlation required for significance under Holm–Šidák (respectively Benjamini–Hochberg) correction. b Computational analysis of the most accurately predicted genes in MSI-H patients from TCGA-COAD. p-values were calculated using right-tailed Fisher’s exact test. c Setup: in hospital A, a neural network is trained to predict gene expression from WSIs. The internal transcriptomic representation is then used in hospital B to improve MSI status prediction. d Area under the ROC curve (AUC) for the model based on the transcriptomic representation (blue) or directly based on WSIs (red), as a function of the fraction of the TGCA-CRC-DX dataset used in the two hospitals (n = 50 data splits per fraction, averaged over ten different three-fold cross-validations (CVs); solid line and triangles: mean over splits; shaded area: s.d.). Boxplot: distribution of AUCs (500 three-fold CVs) over the whole dataset, for the model based on WSIs (box: interquartile range (IQR); horizontal line: median; whiskers: 1.5 times IQR, triangle: mean; circles: outliers). Star: result from Kather et al.; its location accounts for the different number of patients in the training set with respect to this manuscript; circle: result from the same method with 25% of the data in hospital B (see “Methods”). Dashed line: fractions of the dataset used in panel c. e Boxplots (defined as in panel d) of the distribution of AUCs for MSI status classifiers at hospital B, trained, respectively, on the 256-dimensional transcriptomic representation, WSIs, and 256-dimensional representations given by two autoencoders trained on hospital A and B subsets. Dashed line: result obtained by adapting Kather et al. method. Circles: average over ten three-fold CVs for each split between the hospitals (n = 50). **p < 0.01, ***p < 0.001, and ****p < 0.0001, two-tailed Wilcoxon test. WSI whole-slide image. Hospital illustration based on “hospital” by H Alberto Gongora, from thenounproject.com, used under CC-BY 3.0/colored.

References

    1. Zarella MD, et al. A practical guide to whole slide imaging: a white paper from the digital pathology association. Arch. Pathol. Lab. Med. 2019;143:222–234. - PubMed
    1. Mukhopadhyay S, et al. Whole slide imaging versus microscopy for primary diagnosis in surgical pathology: a multicenter blinded randomized noninferiority study of 1992 cases (pivotal study) Am. J. Surg. Pathol. 2018;42:39–52. - PMC - PubMed
    1. Wang H, et al. Mitosis detection in breast cancer pathology images by combining handcrafted and convolutional neural network features. J. Med. Imaging Bellingham Wash. 2014;1:034003. - PMC - PubMed
    1. Turkki R, Linder N, Kovanen PE, Pellinen T, Lundin J. Antibody-supervised deep learning for quantification of tumor-infiltrating immune cells in hematoxylin and eosin stained breast cancer samples. J. Pathol. Inform. 2016;7:38. - PMC - PubMed
    1. Hou L, et al. Patch-based convolutional neural network for whole slide tissue image classification. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016;2016:2424–2433. - PMC - PubMed