Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct;634(8033):466-473.
doi: 10.1038/s41586-024-07618-3. Epub 2024 Jun 12.

A multimodal generative AI copilot for human pathology

Affiliations

A multimodal generative AI copilot for human pathology

Ming Y Lu et al. Nature. 2024 Oct.

Abstract

Computational pathology1,2 has witnessed considerable progress in the development of both task-specific predictive models and task-agnostic self-supervised vision encoders3,4. However, despite the explosive growth of generative artificial intelligence (AI), there have been few studies on building general-purpose multimodal AI assistants and copilots5 tailored to pathology. Here we present PathChat, a vision-language generalist AI assistant for human pathology. We built PathChat by adapting a foundational vision encoder for pathology, combining it with a pretrained large language model and fine-tuning the whole system on over 456,000 diverse visual-language instructions consisting of 999,202 question and answer turns. We compare PathChat with several multimodal vision-language AI assistants and GPT-4V, which powers the commercially available multimodal general-purpose AI assistant ChatGPT-4 (ref. 6). PathChat achieved state-of-the-art performance on multiple-choice diagnostic questions from cases with diverse tissue origins and disease models. Furthermore, using open-ended questions and human expert evaluation, we found that overall PathChat produced more accurate and pathologist-preferable responses to diverse queries related to pathology. As an interactive vision-language AI copilot that can flexibly handle both visual and natural language inputs, PathChat may potentially find impactful applications in pathology education, research and human-in-the-loop clinical decision-making.

PubMed Disclaimer

Conflict of interest statement

A patent corresponding to this work has been filed by Mass General Brigham (Application 63/608,671). The tools, processes and models associated with PathChat have been exclusively licensed to ModellaAI. L.P.L., M.Y.L., R.J.C., B.C., F.M., D.F.K.W and J.J.W. hold equity interests in ModellaAI.

Figures

Fig. 1
Fig. 1. Curation of instruction-following dataset and PathChat overview.
a, We curated what is presently the largest instruction fine-tuning dataset specialized for pathology. It consists of 456,916 instructions and corresponding responses covering various formats (for example, multi-turn conversations, multiple-choice questions and short answers; see Extended Data Fig. 1 for complete examples) from diverse sources. b, To build an MLLM-based vision-language AI assistant that can reason over visual and natural language inputs, we began with a SOTA, vision-only, self-supervised, pretrained, foundation, encoder model, UNI and performed further vision-language pretraining analogous to CONCH. The resulting vision encoder was subsequently connected to a 13-billion-parameter, pretrained, Llama 2 LLM through a multimodal projector module (not shown) to form the complete MLLM architecture. The MLLM was fine-tuned on the curated instruction-following dataset to build PathChat, a vision-language AI assistant specialized for human pathology. More details about data curation and model training can be found in ‘Curation of the PathChat dataset’ and ‘Design and training of the PathChat model’ in Methods, respectively. Scale bars, 200 µm.
Fig. 2
Fig. 2. Multiple-choice evaluation of PathChat.
a, Illustrative example of a multiple-choice diagnostic question. The input always includes a salient ROI of an histology image selected by a board-certified anatomic pathologist and an instruction to select the most probable diagnosis from a set of possible choices. In the image + clinical context evaluation setting, which was designed to more closely mimic a real-world diagnostic workflow, relevant clinical context (designed by the pathologist, shown in blue) is provided together with the histology image and prepended to the original question. Scale bar, 200 µm. b, Accuracy of MLLMs on multiple-choice diagnostic questions. Combined (n = 105 questions), PathQABench-Public (n = 52) and PathQABench-Private (n = 53). Note that we compare against GPT-4V only for questions based on publicly available cases (PathQABench-Public). Error bars represent 95% confidence intervals, and the centres represent the computed accuracy.
Fig. 3
Fig. 3. Open-response evaluation of PathChat and reader study from a panel of seven pathologists.
a, Evaluation workflow for ranking model outputs for open-ended questions. A panel of seven pathologists were recruited to assess the model responses for the 260 open-ended questions. The ordering of responses by the four AI assistant models were randomly shuffled for each question and each pathologist independently ranked them for all questions while being blinded to which model produced which response (see ‘MLLM evaluation’ in Methods for more details). Scale bar, 200 µm. b, Head-to-head records on open-ended questions for PathChat versus other MLLMs evaluated by seven pathologists independently. Win, PathChat was ranked higher than the model. Tie, PathChat tied with the model in terms of ranking. Lose: Said model was ranked higher than PathChat. Vertical bars represent median win rate (dark green) across all seven pathologists and median win + tie rate (light green). c, Accuracy of MLLMs on a subset (n = 235 questions) of open-ended questions for which two pathologists reached a consensus after discussing independent evaluations of model responses. d, Accuracy for different categories of questions on the consensus subset. Microscopy (n = 101), diagnosis (n = 79), clinical (n = 61) and ancillary testing (n = 76). Each question could belong to more than one category. In c,d, error bars represent 95% confidence intervals, and the centres represent the computed accuracy.
Fig. 4
Fig. 4. Exploring use cases of PathChat.
ae, Beyond evaluating PathChat on answering multiple-choice and single-turn open-ended questions, we explored other use cases. The panels contain examples that involve a follow-up from users in the form of interactive, multi-turn conversations. These examples are illustrative in nature and intended to complement our quantitative evaluation of PathQABench. a, PathChat summarized key morphological features in an histology image. Based on the clinical context, it could reasonably infer the primary origin of the tumour. b, PathChat is familiar with different cell markers and can potentially help by guiding IHC interpretations. c, PathChat understands and can attempt to follow well-known guidelines on tumour grading, in this case, the Gleason grading system for prostate adenocarcinoma. d, PathChat can describe tumour tissue and cell morphology, infer a diagnosis and correctly suggest potential IHC findings grounded in relevant background knowledge about the suspected malignancy. e, PathChat can potentially be consulted to perform human-in-the-loop differential diagnosis that may require several rounds of an IHC workup. Scale bars, 200 µm.
Extended Data Fig. 1
Extended Data Fig. 1. Examples of instructions for finetuning MLLM.
An example of each of six different types of instructions to develop PathChat via instruction finetuning is illustrated. Bolded texts represent instructions provided to the model while italicized texts represent the reference outputs the model is expected to output during training. More details on dataset curation are provided in the PathChat dataset curation section of Methods. Scale bars are 200 µm.
Extended Data Fig. 2
Extended Data Fig. 2. Utilization of visual input and clinical context in multiple choice diagnostic questions.
On the multiple choice diagnostic benchmarks (Combined, n = 105 questions; PathQABench-Private, n = 53; PathQABench-Public, n = 52), we investigated whether PathChat can effectively leverage both unstructured clinical context in the form of natural language as well as visual features in the image ROI instead of deriving its answer solely based on either input alone. In the context only setting, the clinical context is provided to the model but the image is not provided (see Fig. 2a for an example multiple choice question that contains the clinical context, the choices, and the image). On the flip side, in the image only setting, the clinical context is not provided, and the model is asked to infer the correct diagnosis from the possible choices based solely on the image. We observed that PathChat achieves maximum performance when both clinical context and the image are provided. Error bars represent 95% confidence intervals, and the centers represent the computed accuracy.
Extended Data Fig. 3
Extended Data Fig. 3. Comparing model outputs on open-ended question answering, example 1.
An example question in PathQABench-Public regarding uveal melanoma, for which the response by PathChat is ranked higher (considered more preferable by expert pathologists) than other models as it clearly, correctly, and fully addresses the query. The other models give incorrect locations that the image is from, give an incorrect description of the image, or are so general as to be unhelpful. Scale bar is 200 µm.
Extended Data Fig. 4
Extended Data Fig. 4. Comparing model outputs on open-ended question answering, example 2.
An example question in PathQABench-Public regarding glioblastoma for which the responses by all models were considered to be of roughly comparable quality by expert pathologists for all producing a reasonable and reasonably accurate response to the query, though with some variation between them. Scale bar is 200 µm.
Extended Data Fig. 5
Extended Data Fig. 5. Comparing model outputs on open-ended question answering, example 3.
An example question in PathQABench-Public regarding lung adenocarcinoma where all four models performed poorly. None of the four models accurately describe the image or produce the correct diagnosis. Scale bar is 200 µm.
Extended Data Fig. 6
Extended Data Fig. 6. Individual pathologist evaluation of open response performance.
a. Accuracy of MLLMs on open-ended questions (n = 260) as evaluated by two pathologists. See Fig. 3c,d for accuracy on the subset of open-ended questions for which the two pathologists reached a consensus. See MLLM evaluation in Methods for details. b. Accuracy on different categories of questions as rated by two pathologists. Microscopy (n = 109), Diagnosis (n = 87), Clinical (n = 68), Ancillary Testing (n = 87). Each question may belong to more than one category. a, b: Error bars represent 95% confidence intervals, and the centers represent the computed accuracy.
Extended Data Fig. 7
Extended Data Fig. 7. Example questions from PathQABench-Public.
PathQABench contains 260 high quality, expert reviewed, open-ended questions created using cases from PathQABench-Public, aimed at assessing a wide range of skills relevant to the practice of pathology. Each question is assigned one or more broad and sub-category based on the topics and skills that it aims to assess. The broad categories are “Microscopy”, “Diagnosis”, “Clinical” and “Ancillary testing”. A detailed description of each category is included in Supplementary Data Table 15. Scale bars are 200 µm.
Extended Data Fig. 8
Extended Data Fig. 8. Performance on PathQABench open-ended questions stratified by broad categories.
We analyze the head-to-head performance of PathChat against other MLLMs in each broad category of questions evaluated by 7 pathologists independently. For each competing model (LLaVA 1.5, LLaVA-Med, GPT4V), we compute the win/tie/lose rate of PathChat against said model. Win (dark green): PathChat is ranked higher than the model; Tie (light green): PathChat is tied with the model in ranking; Lose (red): PathChat is ranked lower than the model. Vertical bars represent median win rate (dark green) across all 7 pathologists and median win+tie rate (light green).
Extended Data Fig. 9
Extended Data Fig. 9. Performance on PathQABench open-ended questions stratified by sub-categories.
We further analyze the head-to-head performance of PathChat against other MLLMs in each sub-category of questions evaluated by 7 pathologists independently. For each competing model (LLaVA 1.5, LLaVA-Med, GPT4V), we compute the win/tie/lose rate of PathChat against said model. Win (dark green): PathChat is ranked higher than the model; Tie (light green): PathChat is tied with the model in ranking; Lose (red): PathChat is ranked lower than the model. Vertical bars represent median win rate (dark green) across all 7 pathologists and median win+tie rate (light green).
Extended Data Fig. 10
Extended Data Fig. 10. Example of human-in-the-loop differential diagnosis with PathChat in a case of cancer of unknown primary.
PathChat can potentially be used to help the user perform human-in-the-loop differential diagnosis that combines a representative histology image, relevant clinical context, as well as follow-up IHC results. Note that in this example, PathChat erroneously implies that cervical cancers should be positive for CK7 and CK20 IHC when in fact, cervical cancers are usually positive for CK7 but negative for CK20. Scale bar is 200 µm.

References

    1. Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nat. Rev. Bioeng.1, 930–949 (2023).
    1. Shmatko, A. et al. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat. Cancer3, 1026–1038 (2022). - PubMed
    1. Chen, R. J et al. Towards a general-purpose foundation model for computational pathology. Nat. Med.30, 850–862 (2024). - PMC - PubMed
    1. Ciga, O., Xu T. & Martel A. L. Self supervised contrastive learning for digital histopathology. Mach. Learn. Appl.7, 100198 (2022).
    1. Liu, H. et al. Visual instruction tuning. In Proc. Advances in Neural Information Processing Systems (eds Oh, A. et al.) 34892–34916 (Curran Associates, 2023).