. 2023 Nov 1;39(11):btad651.

doi: 10.1093/bioinformatics/btad651.

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval

Qiao Jin¹, Won Kim¹, Qingyu Chen¹, Donald C Comeau¹, Lana Yeganova¹, W John Wilbur¹, Zhiyong Lu¹

Affiliations

PMID: 37930897
PMCID: PMC10627406
DOI: 10.1093/bioinformatics/btad651

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval

Qiao Jin et al. Bioinformatics. 2023.

. 2023 Nov 1;39(11):btad651.

doi: 10.1093/bioinformatics/btad651.

Authors

Qiao Jin¹, Won Kim¹, Qingyu Chen¹, Donald C Comeau¹, Lana Yeganova¹, W John Wilbur¹, Zhiyong Lu¹

Affiliation

¹ National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States.

PMID: 37930897
PMCID: PMC10627406
DOI: 10.1093/bioinformatics/btad651

Abstract

Motivation: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine.

Results: To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

Availability and implementation: The MedCPT code and model are available at https://github.com/ncbi/MedCPT.

Published by Oxford University Press 2023.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
A high-level overview of this work. MedCPT contains a QEnc, a DEnc, and a CrossEnc. The QEnc and DEnc compose of the MedCPT retriever, which is contrastively trained by 255M query–article pairs and in-batch negatives from PubMed logs. The CrossEnc is the MedCPT re-ranker, and is contrastively trained by 18M non-keyword query–article pairs and local negatives retrieved from the MedCPT retriever. MedCPT achieves SOTA performance on various biomedical IR tasks under zero-shot settings, including query–article retrieval, sentence representation, and article representation

**Figure 2.**
Overview of the MedCPT training process. (A) Training the MedCPT QEnc and DEnc using a contrastive loss with query–document pairs and in-batch negatives; (B) training the MedCPT CrossEnc using a contrastive loss with non-keyword query–article pairs and local negatives derived from the MedCPT retriever. Models in dashed and solid lines denote un-trained and pre-trained, respectively. MIPS, maximum inner product search

See this image and copyright information in PMC

Update of

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval.
Jin Q, Kim W, Chen Q, Comeau DC, Yeganova L, Wilbur WJ, Lu Z. Jin Q, et al. ArXiv [Preprint]. 2023 Oct 4:arXiv:2307.00589v2. ArXiv. 2023. Update in: Bioinformatics. 2023 Nov 1;39(11):btad651. doi: 10.1093/bioinformatics/btad651. PMID: 41031073 Free PMC article. Updated. Preprint.

References

1. Allot A, Chen Q, Kim S. et al. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res 2019;47:W594–9. - PMC - PubMed
1. Brown P, Tan A-C, El-Esawi MA. et al. ; RELISH Consortium. Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Database 2019;2019:baz085. - PMC - PubMed
1. Brown T, Mann B, Ryder N. et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877–901.
1. Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. In: 2019 IEEE International Conference on Healthcare Informatics (ICHI). Xian, China: IEEE, 2019, 1–5.
1. Cohan A, Feldman S, Beltagy I. et al. SPECTER: document-level representation learning using citation-informed transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online, 2020, 2270–82.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

NIH Intramural Research Program, National Library of Medicine

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval

Affiliation

MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Update of

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources