LitSumm: large language models for literature summarization of noncoding RNAs

Andrew Green¹, Carlos Eduardo Ribas¹, Nancy Ontiveros-Palacios¹, Sam Griffiths-Jones², Anton I Petrov³, Alex Bateman¹, Blake Sweeney¹

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
² School of Biological Sciences, Faculty of Medicine, Biology and Health, Michael Smith Building, The University of Manchester, Manchester M13 9NT, UK.
³ Riboscope Ltd, 23 King St, Cambridge CB1 1AH, UK.

PMID: 39908113
PMCID: PMC11833236
DOI: 10.1093/database/baaf006

LitSumm: large language models for literature summarization of noncoding RNAs

Andrew Green et al. Database (Oxford). 2025.

. 2025 Feb 5:2025:baaf006.

doi: 10.1093/database/baaf006.

Authors

Andrew Green¹, Carlos Eduardo Ribas¹, Nancy Ontiveros-Palacios¹, Sam Griffiths-Jones², Anton I Petrov³, Alex Bateman¹, Blake Sweeney¹

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
² School of Biological Sciences, Faculty of Medicine, Biology and Health, Michael Smith Building, The University of Manchester, Manchester M13 9NT, UK.
³ Riboscope Ltd, 23 King St, Cambridge CB1 1AH, UK.

PMID: 39908113
PMCID: PMC11833236
DOI: 10.1093/database/baaf006

Abstract

Curation of literature in life sciences is a growing challenge. The continued increase in the rate of publication, coupled with the relatively fixed number of curators worldwide, presents a major challenge to developers of biomedical knowledgebases. Very few knowledgebases have resources to scale to the whole relevant literature and all have to prioritize their efforts. In this work, we take a first step to alleviating the lack of curator time in RNA science by generating summaries of literature for noncoding RNAs using large language models (LLMs). We demonstrate that high-quality, factually accurate summaries with accurate references can be automatically generated from the literature using a commercial LLM and a chain of prompts and checks. Manual assessment was carried out for a subset of summaries, with the majority being rated extremely high quality. We apply our tool to a selection of >4600 ncRNAs and make the generated summaries available via the RNAcentral resource. We conclude that automated literature summarization is feasible with the current generation of LLMs, provided that careful prompting and automated checking are applied. Database URL: https://rnacentral.org/.

PubMed Disclaimer

Conflict of interest statement

A.B. is an editor at DATABASE but was not involved in the editorial process of this manuscript.

Figures

**Figure 1.**
(a) The initial prompt used to generate a first-pass summary from the generated context. Variables are enclosed in {} and are replaced with their values before sending the prompt to the LLM. (b) Prompts used for the self-consistency checking stage including inaccurate statement detection and revision. All prompts are reproduced as plain text in Appendix A in the Supplementary material.

**Figure 2.**
A flow diagram of the whole LitSumm tool. Information from the EuropePMC API flows from the left to the right, through a sentence selection step before several rounds of self-checking and refinement. Finished summaries are written to disk before being uploaded to the RNAcentral database enmasse.

**Figure 3.**
The distribution of RNA types selected for summarization.

**Figure 4.**
Example summary generated by the tool. This example is an lncRNA, examples for other RNA types can be found in Appendix C of the Supplementary material.

**Figure 5.**
Example output of the veracity checker. In this case, CTBP1-DT presents two sentences validated as TRUE and two FALSE sentences. The offending sentences have been removed by the model in the final summary.

**Figure 6.**
The average rating per summary across all raters. Note that Rater 3 gave scores only for a subset of 21 miRNAs.

See this image and copyright information in PMC

References

1. International Society for Biocuration . Biocuration: distilling data into knowledge. PLoS Biol 2018;16:e2002846. - PMC - PubMed
1. Bateman A, Martin M-J, Orchard S, UniProt Consortium . UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–D531. - PMC - PubMed
1. Wong ED, Miyasato SR, Aleksander S et al. Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources. Genetics 2023;224:iyac191. - PMC - PubMed
1. Larkin A, Marygold SJ, Antonazzo G et al. FlyBase: updates to the Drosophila melanogaster knowledge base. Nucleic Acids Res 2021;49:D899–907. - PMC - PubMed
1. Joachimiak MP, Caufield JH, Harris NL et al. Gene set summarization using large language models. ArXiv 2023.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

LitSumm: large language models for literature summarization of noncoding RNAs

Affiliations

LitSumm: large language models for literature summarization of noncoding RNAs

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources