Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 11;15(1):4596.
doi: 10.1038/s41467-024-48666-7.

Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides

Affiliations

Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides

Adalberto Claudio Quiros et al. Nat Commun. .

Abstract

Cancer diagnosis and management depend upon the extraction of complex information from microscopy images by pathologists, which requires time-consuming expert interpretation prone to human bias. Supervised deep learning approaches have proven powerful, but are inherently limited by the cost and quality of annotations used for training. Therefore, we present Histomorphological Phenotype Learning, a self-supervised methodology requiring no labels and operating via the automatic discovery of discriminatory features in image tiles. Tiles are grouped into morphologically similar clusters which constitute an atlas of histomorphological phenotypes (HP-Atlas), revealing trajectories from benign to malignant tissue via inflammatory and reactive phenotypes. These clusters have distinct features which can be identified using orthogonal methods, linking histologic, molecular and clinical phenotypes. Applied to lung cancer, we show that they align closely with patient survival, with histopathologically recognised tumor types and growth patterns, and with transcriptomic measures of immunophenotype. These properties are maintained in a multi-cancer study.

PubMed Disclaimer

Conflict of interest statement

The authors declare the following competing interests: A.T. is a co-founder of Imagenomix; N.C. is a scientific advisor for Imagenomix. The other authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1. Overview of Histomorphological Phenotype Learning (HPL) framework architecture.
A Whole slide images (WSIs) are processed for tile extraction and stain normalization. B The self-supervised training of backbone network fθ creates tile vector representations. C Tiles are projected into z vector representations using the frozen backbone network fθ. Continuously, Histomorphological Phenotype Clusters (HPCs) are defined using Leiden community detection over a nearest neighbor graph of z tile vector representations. D WSIs or patients (one or more WSIs per patient) are defined by a compositional vector with dimensionality equal to the number of HPCs and accounts for the percentage of a HPC with respect to the total tissue area. HPL creates WSI and patient compositional vector representations that can be easily used in interpretable models such as logistic regression or cox regression, relating tissue phenotypes with clinical annotations. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. HPCs from Lung adenocarcinoma show consistent enrichment in histomorphological phenotypes.
A Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction of lung adenocarcinoma tile vector representations labeled by HPC membership (each HPC was assigned a different color for easier visualization). B Percentage of patients from the TCGA cohorts associated with each HPC (100% corresponding to 452 patients). The shades of green are proportional to the percentages (y-axis). C Percentage of institutions associated with each HPC (100% corresponding to 33 institutions). The shades of green are proportional to the percentages (y-axis). D Consensus annotations of each HPC after visual inspection by a panel of 3 expert pathologist of 100 random tiles from each HPC. Stars for detailed consensus indicate the number of agreeing pathologists for the predominant tissue component (a given growth pattern/ non-tumor element, see details in Methods - Cluster Histological Assessment), while the number of stars for patients and institutions quality control (QC), are related to panels B and C with percentage above 50%, above or below 25% for 3, 2 and 1 star respectively). Labels were then projected back to the UMAP in panels E-G. Visual representations of E the distribution of the different tissue categories, F the epithelium:stroma ratio, and G the extent of lymphocytic infiltration are displayed on the UMAP. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Consensus description of HPCs enriched in nontumor phenotypes with their representative tiles.
A HPCs enriched with normal and reactive parenchyma. B HPCs enriched with stroma and other specialized tissues. We highlight tile vector representations of HPCs for each nontumor phenotypes A and B. HPCs of interest are colored as in Fig. 2A, while others HPCs remain grey. Consensus was obtained after independent annotations of HPCs by 3 pathologists as described in the Methods section - Cluster Histological Assessment. More examples of tiles for each HPC can be seen in Supplementary Figs. 3-4. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Consensus description of HPCs enriched in tumor phenotypes with their representative tiles.
A HPC enriched with classical adenocarcinoma appearances. B HPCs enriched in variant adenocarcinoma appearances. HPCs of interest are colored as in Fig. 2A, while others HPCs remain grey. Consensus was obtained after independent annotations of HPCs by 3 pathologists as described in the Methods section - Cluster Histological Assessment. More examples of tiles for each HPC can be seen in Supplementary Figs. 3-4. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Wholes slide images of lung adenocarcinoma with HPC overlays.
We display tumors from three representative TCGA patients. A corresponds to patient TCGA-80-5608 who was censored at a 7 year follow-up time, B corresponds to patient TCGA-38-4625 who was censored at a 8 year follow-up time, and C corresponds to patient TCGA-50-5931 who died 14 months after surgery. For each patient we show the original tile images (including tiles with at most 60% of background), and the same tiles but overlaid with a color code representing HPCs, and a legend with the percentage of tiles assigned to a given HPC; we display the most prevalent 10 HPCs per patient. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. Lung adenocarcinoma (LUAD) recurrence-free survival analysis by HPL.
A High and low risk groups showing statistical significance (p value 7.26 × 10−6 < 0.05 using the Logrank test). For each fold in the 5-fold cross validation we defined the high and low risk group threshold by taking the median risk value of the train set and we divided the test set into high and low risk based on this value. Since the test sets are non-overlapping across the 5-fold, at the end of the cross-validation all samples had been labeled as high or low risk based on the test sets of each fold. Error bars on the survival plots represent the 95% CI. B Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction of patient vector representations for the NYU cohort, each representation is labeled according to the risk group for recurrence, low-risk (blue) and high-risk (orange). C SHAP (SHapley Additive exPlanations) plot. D Top relevant HPCs associated with high risk of recurrence. E Top relevant HPCs associated with lower risk of recurrence. F Example of a decision plot for a patient slide classified as high risk of recurrence. We focus on HPCs that contain at least 10% of the total patients motivated by finding tissue patterns that can generalize across the cohort. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Lung adenocarcinoma (LUAD) survival analysis and Histomorphological Phenotype Cluster (HPC) correlations.
A Bi-hierarchical clustering of HPCs and immune signature with correlations from red (positive correlations) to blue (negative). Cox coefficients for overall and recurrence-free survival are colored from purple (favoring death or recurrence) to green (favoring survival or no recurrence). HPCs are colored based on histological assessment of lymphocytic infiltration: dark red: enrichment in severe infiltration; light red: moderate infiltration; light blue: mild infiltration; dark blue: very sparse infiltration; grey: other HPCs. B Bi-hierarchical clustering of HPCs and cell type over (red) and under-representations (blue). C Bi-hierarchical clustering of HPCs and LUAD histological subtype enrichment (red) or depletion (blue). For all panels, the column dendrogram in all subfigures corresponds to the bi-hierarchical clustering of HPCs and immune signatures to more easily relate these analyses in the same context. HPCs associated with poor and good outcome and associated hazard ratios (top rows) come from the Cox regression analysis shown in Supplementary Fig. 6 and Fig. 6 (see Methods - Cluster characterizations). HPCs associated with better survival outcomes show positive correlations with being severe to moderate lymphocytic infiltration and RNASeq signatures of tumor infiltrating leukocytes (TIL), lymphocyte infiltration signature score, T-cell receptors (TCR), and macrophage regulation; and show over-representations of inflammatory, dead, and neoplastic cells. HPCs associated with worse survival outcomes contain mostly mild lymphocytic infiltration content and show positive correlations with proliferation, mutation rate, homologous recombination defects, and wound healing signatures, under-representations for inflammatory and dead cells, and enrichment for solid histological patterns. D Scatter plot between HPC 1 contribution and omic-based immune signatures of each patient (tumor infiltration leukocytes (TIL) and leukocyte fraction), with representative HPC 1 tiles from TCGA and NYU1 cohorts. E Scatter plot between HPC 15 contribution and omic-based immune signatures of each patient (proliferation and Th2 cells), withrepresentative HPC 15 tiles from TCGA and NYU1 cohorts. Two-sided Spearman correlation used for pannels D and E. F Uniform Manifold Approximation and Projection dimensionality reduction of the vector representations of the 224 × 224 tissue tiles, each tile label corresponds to the cluster cell type enrichment. Source data are provided as a Source Data file.
Fig. 8
Fig. 8. Multi-cancer HPL pipeline and main enrichments of the resulting HPCs.
A Multicancer pipeline: the 10 selected cancer types with sample sizes from 232 to 1011 patients (left) were fed to the HPL pipeline (middle), leading to 34 HPCs (right). B Example of 4 transcriptomic immune features which were highly correlated with specific HPCs as identified through Spearman correlation, and visualization on the UMAPs of tiles from HPCs highly enriched (in red) and depleted (in blue) in C TIL regional Fraction, D proliferation, E TGF-beta response, F macrophage regulation, G wound healing and H stromal fraction. Source data are provided as a Source Data file.
Fig. 9
Fig. 9. Correlation of multicancer HPCs with immune signatures and survival.
A Full Spearman correlation analysis between HPc and immune features. The correlation values are shown in the graph if their were statistically significant (p value < 0.01, two-sided Spearman correlation) and displayed in red for enrichment, and blue for depletion. Two groups of HPCs showing good outcome, and two showing poor outcome are highlighted. Representative tiles from those HPCs are shown in Supplementary Figs. 24, 27. B Mean C-index for survival analysis of each HPC and cancer type over a 5-fold cross-validation; values below 0.5 (blue) indicate that higher percentage of the HPC favors longer survival (good outcome), while those above 0.5 (red) indicate that higher percentage the HPC favors shorter survival (poor outcome). Four hierarchical clusters are almost exclusively associated with C-indexes below 0.5 or, with C-indexes above 0.5 and are highlighted by the multi-cancer poor or good outcome black boxes. Only statistically significant values of log rank test of the high and low-risk groups are displayed (p value < 0.05). Supplementary Fig. 23 includes mean and 95% confidence intervals of the C-index values over the 5-fold cross-validation per cancer type. See Methods - Cluster Characterizations for computational details. Source data are provided as a Source Data file.

Similar articles

Cited by

References

    1. Moreira, A. L. et al. A grading system for invasive pulmonary adenocarcinoma: A proposal from the international association for the study of lung cancer pathology committee. J. Thorac. Oncol.15, 1599–1610 (2020). - PMC - PubMed
    1. Almendro, V., Marusyk, A. & Polyak, K. Cellular heterogeneity and molecular evolution in cancer. Annu. Rev. Pathol. Mech. Dis.8, 277–302 (2013). - PubMed
    1. de Sousa, V. M. L. & Carvalho, L. Heterogeneity in lung cancer. Pathobiology85, 96–107 (2018). - PubMed
    1. Andrion, A. et al. Malignant mesothelioma of the pleura: interobserver variability. J. Clin. Pathol.48, 856–860 (1995). - PMC - PubMed
    1. Kujan, O. et al. Why oral histopathology suffers inter-observer variability on grading oral epithelial dysplasia: an attempt to understand the sources of variation. Oral. Oncol.43, 224–231 (2007). - PubMed