Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct;24(10):1559-1567.
doi: 10.1038/s41591-018-0177-5. Epub 2018 Sep 17.

Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning

Affiliations

Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning

Nicolas Coudray et al. Nat Med. 2018 Oct.

Abstract

Visual inspection of histopathology slides is one of the main methods used by pathologists to assess the stage, type and subtype of lung tumors. Adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) are the most prevalent subtypes of lung cancer, and their distinction requires visual inspection by an experienced pathologist. In this study, we trained a deep convolutional neural network (inception v3) on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue. The performance of our method is comparable to that of pathologists, with an average area under the curve (AUC) of 0.97. Our model was validated on independent datasets of frozen tissues, formalin-fixed paraffin-embedded tissues and biopsies. Furthermore, we trained the network to predict the ten most commonly mutated genes in LUAD. We found that six of them-STK11, EGFR, FAT1, SETBP1, KRAS and TP53-can be predicted from pathology images, with AUCs from 0.733 to 0.856 as measured on a held-out population. These findings suggest that deep-learning models can assist pathologists in the detection of cancer subtype or gene mutations. Our approach can be applied to any cancer type, and the code is available at https://github.com/ncoudray/DeepPATH .

PubMed Disclaimer

Conflict of interest statement

Competing Financial Interests Statement

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Data and strategy:
(a) Number of whole-slide images per class. (b) Strategy: (b1) Images of lung cancer tissues were first downloaded from the Genomic Data Common database; (b2) slides were then separated into a training (70%), a validation (15%) and a test set (15%); (b3) slides were tiled by non-overlapping 512×512 pixels windows, omitting those with over 50% background; (b4) the Inception v3 architecture was used and partially or fully re-trained using the training and validation tiles; (b5) classifications were performed on tiles from an independent test set and the results were finally aggregated per slide to extract the heatmaps and the AUC statistics. (c) Size distribution of the images widths (gray) and heights (black). (d) Distribution of the number of tiles per slide.
Figure 2.
Figure 2.. Classification of presence and type of tumor on alternative cohorts:
Receiver Operating Characteristic (ROC) curves (left) from tests on (a) frozen sections (n=98 biologically independent slides), (b) formalin-fixed paraffin-embedded (FFPE) sections (n=140 biologically independent slides) and (c) biopsies (n=102 biologically independent slides) from NYU Langone Medical Center. On the right of each plot, we show examples of raw images with an overlap in light grey of the mask generated by a pathologist and the corresponding heatmaps obtained with the three-way classifier. Scale bars are 1 mm.
Figure 3.
Figure 3.. Gene mutation prediction from histopathology slides give promising results for at least 6 genes:
(a) Mutation probability distribution for slides where each mutation is present or absent (tile aggregation by averaging output probability). (b) ROC curves associated with the top four predictions (a). (c) Allele frequency as a function of slides classified by the deep learning network as having a certain gene mutation (P≥0.5), or the wild-type (P<0.5). p-values estimated with two-tailed Mann-Whitney U-test are shown as ns (p>0.05), * (p≤0.05), ** (p≤0.01) or *** (p≤0.001). For a, b and c, n=62 slides from 59 patients. For the two box plots, whiskers represent the minima and maxima. The middle line within the box represents the median.
Figure 4.
Figure 4.. Spatial heterogeneity of predicted mutations.
(a) Probability distribution on LUAD tiles for the 6 predictable mutations with average values in dotted lines (n=327 non-overlapping tiles). The allele frequency is 0.33 for TP53, 0.25 for STK11 and 0 for the 4 other mutations. (b) heatmap of TP53 and (c) STK11 when only tiles classified as LUAD are selected, and in (d) and (e) when all the tiles are considered. Scale bars are 1 mm.

Comment in

Similar articles

Cited by

References

    1. Travis WD et al. International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society International Multidisciplinary Classification of Lung Adenocarcinoma. Journal of Thoracic Oncology 6, 244–285 (2011). - PMC - PubMed
    1. Hanna N et al. Systemic therapy for stage IV non–small-cell lung cancer: American Society of Clinical Oncology clinical practice guideline update. Journal of Clinical Oncology 35, 3484–3515 (2017). - PubMed
    1. Chan BA & Hughes BG Targeted therapy for non-small cell lung cancer: current standards and the promise of the future. Translational Lung Cancer Research 4, 36–54 (2015). - PMC - PubMed
    1. Parums DV Current status of targeted therapy in non-small cell lung cancer. Drugs Today (Barc). 50, 503–525 (2014). - PubMed
    1. Terra SB et al. Molecular characterization of pulmonary sarcomatoid carcinoma: analysis of 33 cases. Modern Pathology 29, 824–831 (2016). - PubMed

Methods-Only References

    1. The Cancer Genome Atlas Research Network. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012). - PMC - PubMed
    1. The Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014). - PMC - PubMed
    1. Hanley JA & McNeil BJ The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982). - PubMed
    1. Pedregosa F et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011).
    1. Efron B & Tibshirani RJ An introduction to the bootstrap. Vol. 56 (1994).

Publication types

MeSH terms

Substances