Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jan 19:2023.09.28.560068.
doi: 10.1101/2023.09.28.560068.

Digital profiling of cancer transcriptomes from histology images with grouped vision attention

Affiliations

Digital profiling of cancer transcriptomes from histology images with grouped vision attention

Yuanning Zheng et al. bioRxiv. .

Update in

Abstract

Cancer is a heterogeneous disease that demands precise molecular profiling for better understanding and management. Recently, deep learning has demonstrated potentials for cost-efficient prediction of molecular alterations from histology images. While transformer-based deep learning architectures have enabled significant progress in non-medical domains, their application to histology images remains limited due to small dataset sizes coupled with the explosion of trainable parameters. Here, we develop SEQUOIA, a transformer model to predict cancer transcriptomes from whole-slide histology images. To enable the full potential of transformers, we first pre-train the model using data from 1,802 normal tissues. Then, we fine-tune and evaluate the model in 4,331 tumor samples across nine cancer types. The prediction performance is assessed at individual gene levels and pathway levels through Pearson correlation analysis and root mean square error. The generalization capacity is validated across two independent cohorts comprising 1,305 tumors. In predicting the expression levels of 25,749 genes, the highest performance is observed in cancers from breast, kidney and lung, where SEQUOIA accurately predicts the expression of 11,069, 10,086 and 8,759 genes, respectively. The accurately predicted genes are associated with the regulation of inflammatory response, cell cycles and metabolisms. While the model is trained at the tissue level, we showcase its potential in predicting spatial gene expression patterns using spatial transcriptomics datasets. Leveraging the prediction performance, we develop a digital gene expression signature that predicts the risk of recurrence in breast cancer. SEQUOIA deciphers clinically relevant gene expression patterns from histology images, opening avenues for improved cancer management and personalized therapies.

PubMed Disclaimer

Figures

Fig. 1:
Fig. 1:. Overview of the workflow for the SEQUOIA model.
a) Cancer types on which the SEQUOIA model is developed and validated. The panel is created with BioRender.com. b) The model is trained and evaluated using matched WSIs and bulk RNA-Seq data from nine cancer types available in the TCGA database. To pretrain the transformer encoder, we use matched WSIs and gene expression data of normal tissues from the GTEx database. The model is independently validated using data from the CPTAC and Tempus cohorts. Apart from predicting tissue-level gene expression, we integrate a spatial prediction technique that elucidates region-level gene expression patterns within tumor tissues, validated using a spatial transcriptomics dataset [5]. Clinical utility is demonstrated by evaluating the model’s capacity to predict cancer recurrence. c) SEQUOIA architecture. First, N tiles are sampled from the WSI, and a feature vector is extracted from each tile using a ResNet-50 module pretrained on ImageNet. We then cluster the feature vectors into K clusters, and an average feature vector is obtained from each cluster, resulting in K aggregated feature vectors. Next, a transformer encoder and dense layers translate the obtained K feature vectors to predicted gene expression values. d) Performance of SEQUOIA compared to HE2RNA. For both architectures, we show the performance when trained from scratch and when finetuning from a model pretrained on normal tissues. Violin plots illustrate the distribution of Pearson correlation coefficients (left y axis) between the predicted and ground truth gene expression values in TCGA test sets. The top 1,000 genes with the highest correlation coefficients in each architecture are shown. Black squares indicate the absolute number (right y axis) of genes with significantly well-predicted expression levels. e) Distribution of RMSE values between the ground truth and predicted gene expression levels in TCGA test sets. WSI: whole-slide images; RMSE: root mean square error.
Fig. 2:
Fig. 2:. Evaluation of gene expression predictions at the pathway level.
a) Heatmap showing significant P values obtained from hyper-geometric tests in gene ontology analysis of the well-predicted genes. Color and size of the circles represent the negative log-transformed P values. Integers represent the absolute gene count in each category, and non-significant categories are left in blank. b) Circos plot showing the enriched biological processes associated with the well-predicted genes in GBM. Gene names are displayed on the left and the corresponding biological processes are shown on the right. c) Heatmap showing the significant P values of the KEGG pathways across cancer types. Color and size of the circles represent the negative log-transformed P values. Integers represent the absolute gene count in each category, and non-significant categories are left in blank. d) Circos plot showing the KEGG pathways associated with the well-predicted genes in COAD. Gene names are displayed on the left and the corresponding pathways on the right. e) Violin plots illustrating the distribution of Pearson correlation coefficients (left y axis) between the predicted and ground truth pathway enrichment scores in TCGA test sets. The top 100 pathways with the highest correlation coefficients in each model are shown. Black squares indicate the absolute number (right y axis) of pathways with significantly well-predicted enrichment scores in each cancer. f) Violin plots illustrating the distribution of RMSE values between the predicted and ground truth pathway enrichment scores in TCGA test sets. The top 100 pathways with the lowest RMSE values in each model are shown. RMSE: root mean square error.
Fig. 3:
Fig. 3:. Characterization of the genes validated in external cancer cohorts.
a) The number of genes validated in the CPTAC cohort. Percentages enclosed within the parenthesis indicate the proportion of significant genes discovered in the TCGA cohort that were validated in the CPTAC cohort. b) Heatmap showing the significant P values from the gene ontology analysis of the validated genes. Color and size of the circles represent the negative log-transformed P values. Integers represent the absolute gene count in each category, and non-significant categories are left in blank. c) Heatmap showing the significant P values from the KEGG analysis of the validated genes. Color and size of the circles represent the negative log-transformed P values. Integers represent the absolute gene count in each category, and non-significant categories are left in blank. d) Circos plot showing the enriched biological processes associated with the validated genes in lung adenocarcinoma.P values were adjusted for multiple testing using the Benjamini–Hochberg method.
Fig. 4:
Fig. 4:. Development and validation of a digital signature for predicting breast cancer recurrence.
a) Kaplan-Meier curves of recurrence-free survival obtained from the TCGA discovery dataset. Patients were split by the median risk score. b) Kaplan-Meier curves of recurrence-free survival in the SCANB validation dataset. c) Circos plot showing the biological processes associated with the prognostic gene signature. Gene names and the associated risk coefficients are shown on the left and the corresponding biological processes are shown on the right. d) Kaplan-Meier curves of recurrence-free survival obtained from the predicted gene expression values in the TCGA dataset. Patients were split by the median risk score. HR: hazard ratio.
Fig. 5:
Fig. 5:. Spatial visualization of gene expression predicted at the tile level.
a) Whole Slide Image thumbnails from the validation cohort. b) Examples of genes that are well-predicted spatially within slides, with predicted spatial gene expression on the left and ground truth on the right. The prediction and ground truth maps were normalized to percentile scores between 0–100. c) Examples of genes that are spatially well-predicted across several slides. Each row shows the prediction map (on the left) and ground truth (on the right) for a particular gene across four slides. d) Heatmap showing the correlation coefficients of meta-gene modules that define the transcriptional subtype and proliferation state of GBM cells. e) Spatial organization of the predicted transcriptional subtypes within different slides. Transcriptional subtypes were assigned based on the meta-gene module showing the highest prediction values

References

    1. Hausser J., Alon U.: Tumour heterogeneity and the evolutionary trade-offs of cancer. Nature Reviews Cancer 20(4), 247–257 (2020) - PubMed
    1. Network C.G.A.R., et al.: Comprehensive molecular profiling of lung adenocarcinoma. Nature 511(7511), 543 (2014) - PMC - PubMed
    1. Vasaikar S., Huang C., Wang X., Petyuk V.A., Savage S.R., Wen B., Dou Y., Zhang Y., Shi Z., Arshad O.A., et al.: Proteogenomic analysis of human colon cancer reveals new therapeutic opportunities. Cell 177(4), 1035–1049 (2019) - PMC - PubMed
    1. Zheng Y., Luo L., Lambertz I.U., Conti C.J., Fuchs-Young R.: Early dietary exposures epigenetically program mammary cancer susceptibility through igf1-mediated expansion of the mammary stem cell compartment. Cells 11(16), 2558 (2022) - PMC - PubMed
    1. Ravi V.M., Will P., Kueckelhaus J., Sun N., Joseph K., Salié H., Vollmer L., Kuliesiute U., Ehr J., Benotmane J.K., et al.: Spatially resolved multi-omics deciphers bidirectional tumor-host interdependence in glioblastoma. Cancer Cell 40(6), 639–655 (2022) - PubMed

Publication types