Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jan 4;45(D1):D170-D176.
doi: 10.1093/nar/gkw1081. Epub 2016 Nov 28.

Uniclust databases of clustered and deeply annotated protein sequences and alignments

Affiliations

Uniclust databases of clustered and deeply annotated protein sequences and alignments

Milot Mirdita et al. Nucleic Acids Res. .

Abstract

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uniboost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust.mmseqs.com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Visualization of a cluster with the multiple sequence alignment including domain annotations, the taxonomic tree for the species of the cluster's member sequences, domain annotations, summary of sequence annotation keywords, and protein evidence values.
Figure 2.
Figure 2.
(A) Sequence identities averaged over all clusters of Uniclust30, Uniclust50, Uniclust90, UniRef50 and UniRef90. We compute the mean and worst sequence identity between all possible pairs of sequences in a cluster. If a cluster contains more than ten sequences we sample ten sequences to compute the sequence identities. (B) Annotation consistency scores averaged over all clusters of Uniclust30, Uniclust50, Uniclust90, UniRef50 and UniRef90. We compute the mean and worst annotation consistency between the representative sequence and all other cluster members for Gene Ontology annotations (top-left), protein names (top-right) keywords (bottom-left), and the average of the former three (bottom-right). (C) Total Pfam annotation count difference between Uniclust and UniProt. (D) Comparison of the fraction of proteins in ten model organisms with Pfam annotations in Uniclust and in UniProt.

References

    1. The UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015; 43:D204–D212. - PMC - PubMed
    1. Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers E.W.. GenBank. Nucleic Acids Res. 2013; 41:D36–D42. - PMC - PubMed
    1. Koepfli K.-P., Paten B., O'Brien S.J., The Genome 10K Community of Scientists . The genome 10K project: A way forward. Annu. Rev. Anim. Biosci. 2015; 3:57–111. - PMC - PubMed
    1. Zhang G., Li C., Li Q., Li B., Larkin D.M., Lee C., Storz J.F., Antunes A., Greenwold M.J., Meredith R.W. et al. . Comparative genomics reveals insights into avian genome evolution and adaptation. Science. 2014; 346:1311–1320. - PMC - PubMed
    1. Grigoriev I.V., Nikitin R., Haridas S., Kuo A., Ohm R., Otillar R., Riley R., Salamov A., Zhao X., Korzeniewski F. et al. . MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res. 2014; 42:D699–D704. - PMC - PubMed

Publication types