Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 8;49(D1):D412-D419.
doi: 10.1093/nar/gkaa913.

Pfam: The protein families database in 2021

Affiliations

Pfam: The protein families database in 2021

Jaina Mistry et al. Nucleic Acids Res. .

Abstract

The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome, and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on a clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Growth of UniProtKB, and its coverage by Pfam over the last five Pfam releases. As UniProtKB grows in size, the Pfam sequence and residue coverage is maintained at ∼77 and ∼53%, respectively. The UniProtKB size in the figure corresponds to the version of UniProtKB we used for each Pfam release.
Figure 2.
Figure 2.
Schematic representation of Pfam coverage of the SARS-CoV-2 proteome. The top row of boxes represent the individual virus proteins processed from the precursor polyproteins. These boxes are coloured when they contain more than one Pfam domain and the individual Pfam entries are expanded below.
Figure 3.
Figure 3.
The structure of NSP15 (PDB ID: 6VWW) from Kim et al. shows the three new Pfam domains. (i) CoV_NSP15_N (Pfam:PF19219) Coronavirus replicase NSP15, N-terminal oligomerization domain in red, (ii) CoV_NSP15_M (Pfam:PF19216) Coronavirus replicase NSP15, middle domain in blue and (iii) CoV_NSP15_C (Pfam:PF19215) Coronavirus replicase NSP15, uridylate-specific endoribonuclease in green.
Figure 4.
Figure 4.
Percent of residues in the seed alignment of Pfam entries that are low complexity or disordered as predicted by segmasker and MobiDB-lite, respectively.
Figure 5.
Figure 5.
Pfam coverage of repeat regions in UniProtKB entries from RepeatsDB. Three examples are shown represented by their PDB structures. On the top left PDB ID: 4ffb, chain C, mapping to a region of the HEAT repeats in UniProtKB:P46675 (residues 1–272), with Pfam coverage 0%. In the centre, the Ankyrin region of UniProtKB:P46531, PDB ID: 6py8, chain K (residues 1759–2127), with Pfam coverage 89.4%. Three Pfam domains are detected: Pfam:PF12796 in yellow, Pfam:PF13637 in orange and Pfam:PF00023 in red. On the top right, PDB ID: 3ur4, chain A, mapping to the β-propeller UniProtKB:P61964 (residues 24–334), with Pfam coverage 93%. Only one type of Pfam domain is detected (Pfam:PF00400), shown in alternating shades of blue to facilitate the visualization of the Pfam model phase.

References

    1. Sonnhammer E.L., Eddy S.R., Durbin R.. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997; 28:405–420. - PubMed
    1. El-Gebali S., Mistry J., Bateman A., Eddy S.R., Luciani A., Potter S.C., Qureshi M., Richardson L.J., Salazar G.A., Smart A. et al. .. The Pfam protein families database in 2019. Nucleic Acids Res. 2019; 47:D427–D432. - PMC - PubMed
    1. Finn R.D., Coggill P., Eberhardt R.Y., Eddy S.R., Mistry J., Mitchell A.L., Potter S.C., Punta M., Qureshi M., Sangrador-Vegas A. et al. .. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016; 44:D279–D285. - PMC - PubMed
    1. Consortium U.P. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019; 47:D506–D515. - PMC - PubMed
    1. Chen C., Natale D.A., Finn R.D., Huang H., Zhang J., Wu C.H., Mazumder R.. Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation. PLoS One. 2011; 6:e18910. - PMC - PubMed

Publication types

MeSH terms