InterPro: the protein sequence classification resource in 2025

Matthias Blum¹, Antonina Andreeva¹, Laise Cavalcanti Florentino¹, Sara Rocio Chuguransky¹, Tiago Grego¹, Emma Hobbs¹, Beatriz Lazaro Pinto¹, Ailsa Orr¹, Typhaine Paysan-Lafosse¹, Irina Ponamareva¹, Gustavo A Salazar¹, Nicola Bordin², Peer Bork³, Alan Bridge⁴, Lucy Colwell⁵, Julian Gough⁶, Daniel H Haft⁷, Ivica Letunic⁸, Felipe Llinares-López⁹, Aron Marchler-Bauer⁷, Laetitia Meng-Papaxanthos¹⁰, Huaiyu Mi¹¹, Darren A Natale¹², Christine A Orengo², Arun P Pandurangan⁶, Damiano Piovesan¹³, Catherine Rivoire⁴, Christian J A Sigrist⁴, Narmada Thanki⁷, Françoise Thibaud-Nissen⁷, Paul D Thomas¹¹, Silvio C E Tosatto^{13

14}, Cathy H Wu¹², Alex Bateman¹

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
² Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK.
³ European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany.
⁴ Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva, Switzerland.
⁵ Google DeepMind, Cambridge, MA 02142, USA.
⁶ Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK.
⁷ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
⁸ Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany.
⁹ Google DeepMind, 75009 Paris, France.
¹⁰ Google DeepMind, 8002 Zürich, Switzerland.
¹¹ Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA.
¹² Protein Information Resource, Georgetown University Medical Center, WA, DC 20007, USA.
¹³ Department of Biomedical Sciences, University of Padova, Padova 35121, Italy.
¹⁴ Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari 70126, Italy.

PMID: 39565202
PMCID: PMC11701551
DOI: 10.1093/nar/gkae1082

InterPro: the protein sequence classification resource in 2025

Matthias Blum et al. Nucleic Acids Res. 2025.

. 2025 Jan 6;53(D1):D444-D456.

doi: 10.1093/nar/gkae1082.

Authors

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
² Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK.
³ European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany.
⁴ Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva, Switzerland.
⁵ Google DeepMind, Cambridge, MA 02142, USA.
⁶ Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK.
⁷ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.
⁸ Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany.
⁹ Google DeepMind, 75009 Paris, France.
¹⁰ Google DeepMind, 8002 Zürich, Switzerland.
¹¹ Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA.
¹² Protein Information Resource, Georgetown University Medical Center, WA, DC 20007, USA.
¹³ Department of Biomedical Sciences, University of Padova, Padova 35121, Italy.
¹⁴ Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari 70126, Italy.

PMID: 39565202
PMCID: PMC11701551
DOI: 10.1093/nar/gkae1082

Abstract

InterPro (https://www.ebi.ac.uk/interpro) is a freely accessible resource for the classification of protein sequences into families. It integrates predictive models, known as signatures, from multiple member databases to classify sequences into families and predict the presence of domains and significant sites. The InterPro database provides annotations for over 200 million sequences, ensuring extensive coverage of UniProtKB, the standard repository of protein sequences, and includes mappings to several other major resources, such as Gene Ontology (GO), Protein Data Bank in Europe (PDBe) and the AlphaFold Protein Structure Database. In this publication, we report on the status of InterPro (version 101.0), detailing new developments in the database, associated web interface and software. Notable updates include the increased integration of structures predicted by AlphaFold and the enhanced description of protein families using artificial intelligence. Over the past two years, more than 5000 new InterPro entries have been created. The InterPro website now offers access to 85 000 protein families and domains from its member databases and serves as a long-term archive for retired databases. InterPro data, software and tools are freely available.

PubMed Disclaimer

Figures

**Figure 1.**
InterPro coverage of UniProtKB. (A) InterPro coverage of UniProtKB sequences alongside the growth of UniProtKB over time (January 2014–July 2024). (B) InterPro coverage of amino acid residues in UniProtKB, categorised as follows: residues covered by signatures already integrated into InterPro, signatures from member databases that are awaiting integration, intrinsically disordered regions and regions predicted to be signal peptides, transmembrane domains, or coiled-coils. The remaining residues are classified as unannotated.

**Figure 2.**
Example of InterPro entry IPR053140, automatically generated using AI annotations for PANTHER family PTHR43784. The entry features an ‘AI’ label next to the entry name, short name and description to indicate its AI-generated origin. In this instance, a curator has reviewed, updated the description, and included a reference to a relevant scientific publication for supporting evidence. Users are encouraged to use the ‘Provide feedback’ button to report any inaccuracies or suggest improvements.

**Figure 3.**
Percentages of pairs where one or another model response was preferred by the InterPro curators.

**Figure 4.**
InterPro annotations for the human PIK3CA protein (UniProtKB accession Q15648). The ‘Representative Domains’ track displays selected domains from InterPro member databases, chosen to maximise coverage and minimise overlap, providing users with a comprehensive overview of the sequence's domain organisation. The ‘Domain’ section includes InterPro entries classified as domains, along with the underlying member database signatures that match the protein sequence. The ‘Unintegrated’ section lists annotations from member database signatures that have not yet been integrated into an InterPro entry.

**Figure 5.**
N345K variant in the context of the human PIK3CA protein (UniProtKB accession P42336). The top track illustrates the organisation of Pfam domains, while the bottom track highlights representative domains, with the CDD C2 domain selected over the Pfam C2 domain. Although the N345K mutation, known to reside within the C2 domain, appears outside the Pfam C2 domain, it is located within the boundaries of the CDD C2 domain.

**Figure 6.**
Wall clock time (minutes) consumed by individual member databases to annotate the human proteome using InterProScan with the match lookup disabled. Significant disparities in processing times are evident across the databases, highlighting the computational demands of specific analyses.

**Figure 7.**
Comparison of annotation coverage between Pfam and Pfam-N across key species and model organisms. For each species, Pfam-N consistently shows higher annotation coverage compared to Pfam, highlighting its enhanced ability to annotate a larger proportion of protein sequences, thereby providing more comprehensive functional information across these organisms.

See this image and copyright information in PMC

References

1. Sillitoe I., Bordin N., Dawson N., Waman V.P., Ashford P., Scholes H.M., Pang C.S.M., Woodridge L., Rauer C., Sen N.et al. .. CATH: increased structural coverage of functional space. Nucleic Acids Res. 2021; 49:D266–D273. - PMC - PubMed
1. Wang J., Chitsaz F., Derbyshire M.K., Gonzales N.R., Gwadz M., Lu S., Marchler G.H., Song J.S., Thanki N., Yamashita R.A.et al. .. The conserved domain database in 2023. Nucleic Acids Res. 2023; 51:D384–D388. - PMC - PubMed
1. Pedruzzi I., Rivoire C., Auchincloss A.H., Coudert E., Keller G., de Castro E., Baratin D., Cuche B.A., Bougueleret L., Poux S.et al. .. HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res. 2015; 43:D1064–D1070. - PMC - PubMed
1. Haft D.H., Badretdin A., Coulouris G., DiCuccio M., Durkin A.S., Jovenitti E., Li W., Mersha M., O’Neill K.R., Virothaisakun J.et al. .. RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res. 2024; 52:D762–D769. - PMC - PubMed
1. Thomas P.D., Ebert D., Muruganujan A., Mushayahama T., Albou L.-P., Mi H.. PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci. 2022; 31:8–22. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

InterPro: the protein sequence classification resource in 2025

Affiliations

InterPro: the protein sequence classification resource in 2025

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous