Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2025 May 27;16(1):4886.
doi: 10.1038/s41467-025-60014-x.

Generating dermatopathology reports from gigapixel whole slide images with HistoGPT

Affiliations
Multicenter Study

Generating dermatopathology reports from gigapixel whole slide images with HistoGPT

Manuel Tran et al. Nat Commun. .

Abstract

Histopathology is the reference standard for diagnosing the presence and nature of many diseases, including cancer. However, analyzing tissue samples under a microscope and summarizing the findings in a comprehensive pathology report is time-consuming, labor-intensive, and non-standardized. To address this problem, we present HistoGPT, a vision language model that generates pathology reports from a patient's multiple full-resolution histology images. It is trained on 15,129 whole slide images from 6705 dermatology patients with corresponding pathology reports. The generated reports match the quality of human-written reports for common and homogeneous malignancies, as confirmed by natural language processing metrics and domain expert analysis. We evaluate HistoGPT in an international, multi-center clinical study and show that it can accurately predict tumor subtypes, tumor thickness, and tumor margins in a zero-shot fashion. Our model demonstrates the potential of artificial intelligence to assist pathologists in evaluating, reporting, and understanding routine dermatopathology cases.

PubMed Disclaimer

Conflict of interest statement

Competing interests: M.T. is employed by Roche Diagnostics GmbH but conducted his research independently of his work at Roche Diagnostics GmbH as a guest scientist at Helmholtz Munich (Helmholtz Zentrum München—Deutsches Forschungszentrum für Gesundheit und Umwelt GmbH). The remaining authors declare no competing interests. Ethics: An interdisciplinary team of computer scientists, dermatologists, and pathologists from different institutions worked closely together. They shared their expertise and maintained the integrity of the scientific record throughout the study. Local researchers were involved in the research process to ensure that the study was locally relevant. Roles and responsibilities were agreed prior to the study and capacity-building plans were discussed. All research procedures were conducted in accordance with the Declaration of Helsinki. Ethics approval was granted by the Ethics Committee of the Technical University Munich (reference number 2024-98-S-CB) and the Ethics Committee of Westfalen-Lippe (reference number 2024-157-b-S).

Figures

Fig. 1
Fig. 1. HistoGPT, a foundation vision language model for dermatopathology.
a Traditionally, pathologists analyze tissue samples from patients under a microscope and summarize their findings in a comprehensive pathology report. This manual process is time-consuming, labor-intensive, and non-standardized. b HistoGPT generates human-level written reports, provides disease classification, discriminates between tumor subtypes, predicts tumor depth, detects tumors at surgical margins, and returns text-to-image gradient-attention maps that provide model explainability. All of this serves as a second opinion for the pathologist, who can use the output of HistoGPT as a general overview and first draft for the final report. The generated reports can also be used to fill in standardized templates, as used by some institutions, by extracting the relevant keywords. c An example output for a basal cell carcinoma case from our external Münster cohort. More examples can be viewed interactively at this hyperlink. Source data are provided as a Source Data file.
Fig. 2
Fig. 2. HistoGPT simultaneously learns from vision and language to generate histology reports from whole slide images.
a HistoGPT is available in three sizes (Small, Medium, and Large). It consists of a patch encoder (CTransPath for HistoGPT-S/HistoGPT-M and UNI for HistoGPT-L), a position encoder (used only in HistoGPT-L), a slide encoder (the Perceiver Resampler), a language model (BioGPT base for HistoGPT-S, BioGPT large for HistoGPT-M/HistoGPT-L), and tanh-gated cross-attention blocks (XATTN). Specifically, HistoGPT takes a series of whole slide images (WSIs) at 10×–20× as input and outputs a written report. Optionally, users can query the model for additional details using prompts such as “The tumor thickness is”, and the model will complete the sentence, e.g., “The tumor thickness is 1.2 mm”. b We train HistoGPT in two phases. In the first phase, the vision module of HistoGPT is pre-trained using multiple instance learning (MIL). In the second phase, we freeze the pre-trained layers and fine-tune the language module on the image-text pairs. To prevent the model from overfitting on the same sentences, we apply text augmentation with GPT-4 to paraphrase the original reports. c During deployment, we use an inference method called Ensemble Refinement (ER). Here, the model stochastically generates multiple possible reports using a combination of temperature, top-p, and top-k sampling to capture different aspects of the input image. An aggregation module (GPT-4) then combines the results to provide a more complete description of the underlying case. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. HistoGPT generates human-level pathology reports of skin diseases.
a Our internal Munich dataset is a real-world medical cohort of 15,129 whole slide images from 6705 patients with 167 skin diseases from the Department of Dermatology at the Technical University of Munich. It includes malignant cases such as basal cell carcinoma (BCC, n = 870) and squamous cell carcinoma (SCC, n = 297); precursor lesions such as actinic keratosis (AK, n = 396); as well as benign cases such as benign melanocytic nevus (BMN, n = 770) and seborrheic keratosis (SK, n = 412). We divided the dataset into a training set and a test set using a stratified 75/25 split at the patient level. b Through years of experience, pathologists are often able to make a diagnosis at first glance. Instead of writing a pathology report themselves, they can use HistoGPT in “Expert Guidance” mode by giving the model the correct diagnosis to complete the report. c We evaluated the performance of the model using four semantic-based machine learning metrics: (i) we matched critical medical terms extracted from the original text with the generated text using a dermatology dictionary; (ii) we used the same technique but with ScispaCy, a scientific name entity recognition tool, as the keyword extractor; (iii) we compared the semantic meaning of the original and generated reports by measuring the cosine similarity of their text embeddings generated by the biomedical language model BioBERT; (iv) we used the same technique but with the general purpose large language model GPT-3-ADA for text embedding. d HistoGPT models (blue) surpassed BioGPT-1B (yellow) and GPT-4V (red) on the two text accuracy metrics, Dictionary and ScispaCy, as well as on the two text similarity metrics, BioBERT and GPT-3-ADA (see Methods for details). e Two independent external board-certified dermatopathologists (P1 and P2) evaluated 100 original vs. expert-guided generated reports along with the corresponding whole slide image in a randomized, blinded study. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. HistoGPT accurately predicts diseases in-domain and out-of-domain without human guidance.
a In the absence of a human-in-the-loop, HistoGPT predicts the patient’s diagnosis on its own and generates the corresponding pathology report. b On the Munich test set, HistoGPT was on par with state-of-the-art classification models in predicting over 100 dermatological diseases, even though the model’s output is pure text. c HistoGPT discriminated malignant from benign conditions with high accuracy on the Munich dataset: basal cell carcinoma (BCC, n = 107) vs. other conditions (n = 621) with an accuracy of 0.98 and a weighted F1 score of 0.98; actinic keratosis (AK, n = 47) vs. squamous cell carcinoma (SCC, n = 33) with an accuracy of 0.88 and a weighted F1 score of 0.87; benign melanocytic nevus (BMN, n = 86) vs. melanoma (n = 21) with an accuracy of 0.89 and a weighted F1 score of 0.89. d We evaluated HistoGPT in 5 independent external cohorts (Münster-3H, TCGA-SKCM, CPTAC-CM, Queensland, Linköping) covering different countries, scanner types, staining techniques, and biopsy methods. e HistoGPT performed equal to or better than state-of-the-art MIL on external datasets, especially when using self-prompting (“Classifier Guidance”). The box plots show the quantiles as a black line and the mean as an inner circle obtained from 1000 bootstraps. The minimum and maximum values are shown as white circles at the top and bottom. f HistoGPT was able to produce highly accurate pathology reports, as indicated by the high keyword and cosine-based similarity scores for Münster-1K. As in Fig. 3C, the lower baseline compares two randomly selected reports. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. HistoGPT predicts tumor thickness, subtypes, as well as margins in a zero-shot fashion and provides text-to-image visualization.
a HistoGPT achieved high zero-shot performance in predicting tumor thickness on the internal Munich test set. The scatter plot is color-coded according to the classes in Fig. 3a. b HistoGPT’s prediction was also highly correlated with the ground truth on the external Münster-3H test set, even though it was obtained using a different measurement protocol. c Since HistoGPT is an interpretable AI system, we can understand its outputs. Here we show the two examples marked with a red arrow in this figure (b). Attention scores range from 0 (low attention) to 1 (high attention) as indicated by the color bar. d Encoding the position of each patch for the large HistoGPT model greatly improved its spatial awareness. All scatter plots include the linear regression estimate along with the 95% confidence interval as a shaded area (orange). Statistical tests were performed using a two-tailed test. e On the BCC subset of the independent Münster-3H cohort, HistoGPT was the only slide-level model that correctly predicted infiltrative BCC in most cases. The two patch-level models CONCH and PLIP failed in this task, predicting almost all samples as superficial. f Given WSIs of superficial, solid, and infiltrating BCC, HistoGPT correctly identified their morphological structures as shown by the high attention regions for the respective text strings. g HistoGPT predicted whether the surgical margin contained tumor or healthy cells on the out-of-distribution Münster-1K cohort without fine-tuning. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. HistoGPT produces clinically accurate and consistent pathology reports for common diseases, as confirmed in a real-world, multi-center clinical study.
a Skin biopsies were randomly collected from routine cases at the Mayo Clinic (USA), University Hospital Münster (Germany), and Radboud University Medical Center (The Netherlands). b Two board-certified (dermato-)pathologists at each site evaluated the generated reports according to the following criteria: (5) beyond expectation, (4) highly accurate, (3) generally accurate with minor variations without clinical impact, (2) partially accurate with variations that could have clinical impact, (1) minimally accurate, (0) completely inaccurate. A score greater than 2 indicates a diagnosis that is considered correct or within an acceptable range of subjectivity. c HistoGPT produced consistent and accurate reports for the most common neoplastic epithelial lesions (achieving an average score of 2 or higher for each class) and struggled with classes with little training data (<200 data points) or classes that cannot be predicted from imaging alone (re-excisions). While our training dataset (the Munich cohort) covers all diseases provided by the three institutions, it contains many heterogeneous categories with different numbers of samples and subtypes. For instance, neoplastic cases have 1554 samples across 64 diseases, with an average of 27 data points per class. In addition, reporting standards vary widely between institutions, resulting in large variability in scores. Therefore, we also examined interobserver variability by having dermatopathologists from Mayo (κ = 0.055) and Münster (κ = 0.295) review the reports generated for the Radboud cohort. Source data are provided as a Source Data file.
Fig. 7
Fig. 7. Pathology-informed analysis of failure mechanisms.
a Like all deep learning algorithms, HistoGPT learns the most distinctive features for each class to reliably discriminate between them. However, for diseases that were rarely seen during training (e.g., psoriasis), these features are not sufficient to be applied to unseen cases and may be confused with features from related diseases (e.g., eczema). b Even if the cases were seen often enough during training, the tissue sample may contain tissue composition, color dynamics, and other variations that were not encountered during training. For example, we found an image of erythrocytes similar to images of eosinophils that the model saw during training, leading to the activation of eosinophil-related concepts in the neural network. c Similarly, there was a case of Clark’s level II melanoma (top) that mimicked the Bowenoid growth pattern of squamous cell carcinoma (bottom) and was predicted as squamous cell carcinoma. d Another case was a grade 3 acute graft-versus-host disease (GVHD, top) that mimicked actinic keratosis (bottom)—HistoGPT diagnosed the latter. e When the whole slide image contains components of different diseases, HistoGPT tends to predict the most likely diagnosis (the class seen most often during training), not the most significant one. This happened in a case of a melanocytic nevus that also showed patterns of seborrheic keratosis. Source data are provided as a Source Data file.

References

    1. Histopathology is ripe for automation. Nat. Biomed. Eng.1, 925 https://www.nature.com/articles/s41551-017-0179-5 (2017). - PubMed
    1. Krug, E. & Varghese, C. Guide for Establishing a Pathology Laboratory in the Context Of Cancer Control (World Health Organization, Geneva, Switzerland, 2020).
    1. Brown, L. Improving histopathology turnaround time: a process management approach. Curr. Diagn. Pathol.10, 444–452 (2004).
    1. Märkl, B., Füzesi, L., Huss, R., Bauer, S. & Schaller, T. Number of pathologists in Germany: comparison with European countries, USA, and Canada. Virchows Arch.478, 335–341 (2021). - PMC - PubMed
    1. van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med.27, 775–784 (2021). - PubMed

Publication types

MeSH terms

LinkOut - more resources