Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb 5:2025:baaf006.
doi: 10.1093/database/baaf006.

LitSumm: large language models for literature summarization of noncoding RNAs

Affiliations

LitSumm: large language models for literature summarization of noncoding RNAs

Andrew Green et al. Database (Oxford). .

Abstract

Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide, presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritize their efforts. In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for noncoding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We apply our tool to a selection of >4600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided that careful prompting and automated checking are applied. Database URL: https://rnacentral.org/.

PubMed Disclaimer

Conflict of interest statement

A.B. is an editor at DATABASE but was not involved in the editorial process of this manuscript.

Figures

Figure 1.
Figure 1.
(a) The initial prompt used to generate a first-pass summary from the generated context. Variables are enclosed in {} and are replaced with their values before sending the prompt to the LLM. (b) Prompts used for the self-consistency checking stage including inaccurate statement detection and revision. All prompts are reproduced as plain text in Appendix A in the Supplementary material.
Figure 2.
Figure 2.
A flow diagram of the whole LitSumm tool. Information from the EuropePMC API flows from the left to the right, through a sentence selection step before several rounds of self-checking and refinement. Finished summaries are written to disk before being uploaded to the RNAcentral database enmasse.
Figure 3.
Figure 3.
The distribution of RNA types selected for summarization.
Figure 4.
Figure 4.
Example summary generated by the tool. This example is an lncRNA, examples for other RNA types can be found in Appendix C of the Supplementary material.
Figure 5.
Figure 5.
Example output of the veracity checker. In this case, CTBP1-DT presents two sentences validated as TRUE and two FALSE sentences. The offending sentences have been removed by the model in the final summary.
Figure 6.
Figure 6.
The average rating per summary across all raters. Note that Rater 3 gave scores only for a subset of 21 miRNAs.

References

    1. International Society for Biocuration . Biocuration: distilling data into knowledge. PLoS Biol 2018;16:e2002846. - PMC - PubMed
    1. Bateman A, Martin M-J, Orchard S, UniProt Consortium . UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–D531. - PMC - PubMed
    1. Wong ED, Miyasato SR, Aleksander S et al. Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources. Genetics 2023;224:iyac191. - PMC - PubMed
    1. Larkin A, Marygold SJ, Antonazzo G et al. FlyBase: updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res 2021;49:D899–907. - PMC - PubMed
    1. Joachimiak MP, Caufield JH, Harris NL et al. Gene set summarization using large language models. ArXiv 2023.

Substances

LinkOut - more resources