. 2021 Jan 8;49(D1):D412-D419.

doi: 10.1093/nar/gkaa913.

Pfam: The protein families database in 2021

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
² Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden.
³ Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy.

PMID: 33125078
PMCID: PMC7779014
DOI: 10.1093/nar/gkaa913

Pfam: The protein families database in 2021

Jaina Mistry et al. Nucleic Acids Res. 2021.

. 2021 Jan 8;49(D1):D412-D419.

doi: 10.1093/nar/gkaa913.

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK.
² Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden.
³ Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy.

PMID: 33125078
PMCID: PMC7779014
DOI: 10.1093/nar/gkaa913

Abstract

The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.

PubMed Disclaimer

Figures

**Figure 1.**
Growth of UniProtKB, and its coverage by Pfam over the last five Pfam releases. As UniProtKB grows in size, the Pfam sequence and residue coverage is maintained at ∼77 and ∼53%, respectively. The UniProtKB size in the figure corresponds to the version of UniProtKB we used for each Pfam release.

**Figure 2.**
Schematic representation of Pfam coverage of the SARS-CoV-2 proteome. The top row of boxes represent the individual virus proteins processed from the precursor polyproteins. These boxes are coloured when they contain more than one Pfam domain and the individual Pfam entries are expanded below.

**Figure 3.**
The structure of NSP15 (PDB ID: 6VWW) from Kim *et al.* shows the three new Pfam domains. (i) CoV_NSP15_N (Pfam:PF19219) Coronavirus replicase NSP15, N-terminal oligomerization domain in red, (ii) CoV_NSP15_M (Pfam:PF19216) Coronavirus replicase NSP15, middle domain in blue and (iii) CoV_NSP15_C (Pfam:PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.

**Figure 4.**
Percent of residues in the seed alignment of Pfam entries that are low complexity or disordered as predicted by segmasker and MobiDB-lite, respectively.

**Figure 5.**
Pfam coverage of repeat regions in UniProtKB entries from RepeatsDB. Three examples are shown represented by their PDB structures. On the top left PDB ID: 4ffb, chain C, mapping to a region of the HEAT repeats in UniProtKB:P46675 (residues 1–272), with Pfam coverage 0%. In the centre, the Ankyrin region of UniProtKB:P46531, PDB ID: 6py8, chain K (residues 1759–2127), with Pfam coverage 89.4%. Three Pfam domains are detected: Pfam:PF12796 in yellow, Pfam:PF13637 in orange and Pfam:PF00023 in red. On the top right, PDB ID: 3ur4, chain A, mapping to the β-propeller UniProtKB:P61964 (residues 24–334), with Pfam coverage 93%. Only one type of Pfam domain is detected (Pfam:PF00400), shown in alternating shades of blue to facilitate the visualization of the Pfam model phase.

See this image and copyright information in PMC

References

1. Sonnhammer E.L., Eddy S.R., Durbin R.. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997; 28:405–420. - PubMed
1. El-Gebali S., Mistry J., Bateman A., Eddy S.R., Luciani A., Potter S.C., Qureshi M., Richardson L.J., Salazar G.A., Smart A. et al.. The Pfam protein families database in 2019. Nucleic Acids Res. 2019; 47:D427–D432. - PMC - PubMed
1. Finn R.D., Coggill P., Eberhardt R.Y., Eddy S.R., Mistry J., Mitchell A.L., Potter S.C., Punta M., Qureshi M., Sangrador-Vegas A. et al.. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016; 44:D279–D285. - PMC - PubMed
1. Consortium U.P. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019; 47:D506–D515. - PMC - PubMed
1. Chen C., Natale D.A., Finn R.D., Huang H., Zhang J., Wu C.H., Mazumder R.. Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PLoS One. 2011; 6:e18910. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

108433/Z/15/Z/WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pfam: The protein families database in 2021

Affiliations

Pfam: The protein families database in 2021

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous