Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Oct 4:arXiv:2307.00589v2.

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Affiliations

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Qiao Jin et al. ArXiv. .

Update in

Abstract

Motivation: Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine.

Results: To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks.

Availability: The MedCPT code and API are available at https://github.com/ncbi/MedCPT.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A high-level overview of this work. MedCPT contains a query encoder (QEnc), a document encoder (DEnc), and a cross-encoder (CrossEnc). The query encoder and document encoder compose of the MedCPT retriever, which is contrastively trained by 255M query-article pairs and in-batch negatives from PubMed logs. The cross-encoder is the MedCPT re-ranker, and is contrastively trained by 18M non-keyword query-article pairs and local negatives retrieved from the MedCPT retriever. MedCPT achieves state-of-the-art performance on various biomedical information retrieval tasks under zero-shot settings, including query-article retrieval, sentence representation, and article representation.
Figure 2.
Figure 2.
Overview of the MedCPT training process. (A) Training the MedCPT query encoder (QEnc) and document encoder (DEnc) using a contrastive loss with query-document pairs and in-batch negatives; (B) Training the MedCPT cross-encoder (CrossEnc) using a contrastive loss with non-keyword query-article pairs and local negatives derived from the MedCPT retriever. Models in dashed and solid lines denote un-trained and pre-trained, respectively. MIPS: maximum inner product search.

References

    1. Allot A., et al. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res 2019;47(W1):W594–W599. - PMC - PubMed
    1. Brown P., et al. Large expert-curated database for benchmarking document similarity detection in biomedical literature search. Database 2019;2019.
    1. Brown T., et al. Language models are few-shot learners. Advances in neural information processing systems 2020;33:1877–1901.
    1. Chen Q., Peng Y. and Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. In, 2019 IEEE International Conference on Healthcare Informatics (ICHI). IEEE; 2019. p. 1–5.
    1. Cohan A., et al. SPECTER: Document-level Representation Learning using Citationin-formed Transformers. In, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. p. 2270–2282.

Publication types

LinkOut - more resources