Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 5;52(D1):D791-D797.
doi: 10.1093/nar/gkad1039.

The UNITE database for molecular identification and taxonomic communication of fungi and other eukaryotes: sequences, taxa and classifications reconsidered

Affiliations

The UNITE database for molecular identification and taxonomic communication of fungi and other eukaryotes: sequences, taxa and classifications reconsidered

Kessy Abarenkov et al. Nucleic Acids Res. .

Abstract

UNITE (https://unite.ut.ee) is a web-based database and sequence management environment for molecular identification of eukaryotes. It targets the nuclear ribosomal internal transcribed spacer (ITS) region and offers nearly 10 million such sequences for reference. These are clustered into ∼2.4M species hypotheses (SHs), each assigned a unique digital object identifier (DOI) to promote unambiguous referencing across studies. UNITE users have contributed over 600 000 third-party sequence annotations, which are shared with a range of databases and other community resources. Recent improvements facilitate the detection of cross-kingdom biological associations and the integration of undescribed groups of organisms into everyday biological pursuits. Serving as a digital twin for eukaryotic biodiversity and communities worldwide, the latest release of UNITE offers improved avenues for biodiversity discovery, precise taxonomic communication and integration of biological knowledge across platforms.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
Diagram of the UNITE SH 9.0 calculation steps. The sequences are dereplicated using VSEARCH, and sequences that do not represent the full ITS region according to ITSx are dismissed. Following quality filtering, a series of successive clustering steps of generating subsets of 500 000 (500k) and 30 000 (30k) sequences and selecting core representative sequences (cRepS) is carried out. This yields what are termed ‘compound clusters’, which are sequence clusters roughly at the genus/subgenus level. These are further clustered into species hypotheses (SH). All clustering steps in the SH calculation workflow are performed using the USEARCH tool. The similarity thresholds (97%−95%−90%−80%) for the nested pre-clustering (5c, 6) were chosen to yield clusters at approximately the genus/subgenus level. A dissimilarity threshold (0.5%) for the complete-linkage clustering (5d) was selected to trim the dataset of closely related sequences around the core representative sequences. The core representative sequences undergo the final single-linkage clustering within a dissimilarity range of 0.5−3.0% with a 0.5% step. These dissimilarity thresholds were selected as the most commonly applied in species delimitation and sequence identification. For each SH, a representative sequence is selected, either automatically or based on prior manual curation. The species hypotheses are aligned to form the final SH datasets.
Figure 2.
Figure 2.
The number of species hypotheses at 1.0% and 1.5% between-species distance threshold through the four latest major versions of UNITE. Each SH is assigned a unique DOI every time the SHs are recomputed, and a versioning system keeps track of DOI names and contents over time, allowing users to follow how individual SHs are populated with sequences over time.
Figure 3.
Figure 3.
(A) Treemap of the most abundant taxa (kingdom and phylum) based on the taxonomy of UNITE SHs at 1.0% between-species distance threshold, (B) The number of UNITE SHs at 1.0% distance threshold versus species names per fungal phylum in the Catalogue of Life (CoL) checklist from 2023-06-29.

References

    1. Schoch C.L., Seifert K.A., Huhndorf S., Robert V., Spouge J.L., Levesque C.A., Chen W., Bolchacova E., Voigt K., Crous P.W.et al.. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc. Natl. Acad. Sci. U.S.A. 2012; 109:6241–6246. - PMC - PubMed
    1. Arita M., Karsch-Mizrachi I., Cochrane G.. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2020; 49:D121–D124. - PMC - PubMed
    1. Kõljalg U., Nilsson R.H., Abarenkov K., Tedersoo L., Taylor A.F.S., Bahram M., Bates S.T., Bruns T.D., Bengtsson-Palme J., Callaghan T.M.et al.. Towards a unified paradigm for sequence-based identification of fungi. Mol. Ecol. 2013; 22:5271–5277. - PubMed
    1. Taberlet P., Coissac E., Pompanon F., Brochmann C., Willerslew E.. Towards next-generation biodiversity assessment using DNA metabarcoding. Mol. Ecol. 2012; 21:2045–2050. - PubMed
    1. Bolyen E., Rideout J.R., Dillon M.R., Bokulich N.A., Abnet C.C., Al-Ghalith G.A., Alexander H., Alm E.J., Arumugam M., Asnicar F.et al.. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 2019; 37:852–857. - PMC - PubMed