Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 2;12(4):e0358223.
doi: 10.1128/spectrum.03582-23. Epub 2024 Mar 15.

Annotation of 2,507 Saccharomyces cerevisiae genomes

Affiliations

Annotation of 2,507 Saccharomyces cerevisiae genomes

Meng Wang et al. Microbiol Spectr. .

Erratum in

Abstract

Saccharomyces cerevisiae (baker's yeast, budding yeast) is one of the most important model organisms for biological research and is a crucial microorganism in industry. Currently, a huge number of Saccharomyces cerevisiae genome sequences are available at the public domain. However, these genomes are distributed at different websites and a large number of them are released without annotation information. To provide one complete annotated genome data resource, we collected 2,507 Saccharomyces cerevisiae genome assemblies and re-annotated 2,506 assemblies using a custom annotation pipeline, producing a total of 15,407,164 protein-coding gene models. With a custom pipeline, all these gene sequences were clustered into families. A total of 1,506 single-copy genes were selected as marker genes, which were then used to evaluate the genome completeness and base qualities of all assemblies. Pangenomic analyses were performed based on a selected subset of 847 medium-high-quality genomes. Statistical comparisons revealed a number of gene families showing copy number variations among different organism sources. To the authors' knowledge, this study represents the largest genome annotation project of S. cerevisiae so far, providing rich genomic resources for the future studies of the model organism S. cerevisiae and its relatives.IMPORTANCESaccharomyces cerevisiae (baker's yeast, budding yeast) is one of the most important model organisms for biological research and is a crucial microorganism in industry. Though a huge number of Saccharomyces cerevisiae genome sequences are available at the public domain, these genomes are distributed at different websites and most are released without annotation, hindering the efficient reuse of these genome resources. Here, we collected 2,507 genomes for Saccharomyces cerevisiae, performed genome annotation, and evaluated the genome qualities. All the obtained data have been deposited at public repositories and are freely accessible to the community. This study represents the largest genome annotation project of S. cerevisiae so far, providing one complete annotated genome data set for S. cerevisiae, an important workhorse for fundamental biology, biotechnology, and industry.

Keywords: Saccharomyces cerevisiae; annotation; genome.

PubMed Disclaimer

Conflict of interest statement

Xiaoping Hou, Yang He, Jun-Hong Yu, Shumin Hu, and Hua Yin are employed by Tsingtao Brewery Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Fig 1
Fig 1
Overall flowchart of this study. Data shown in red have been deposited at figshare and Zenodo. Codes for the annotation pipeline and cluster pipeline (blue) have been deposited at GitHub.
Fig 2
Fig 2
Benchmarking the genome annotation pipeline. The S288c genome has been annotated with the annotation pipeline in three modes: the complete mode (a,b), step 1 only mode (c,d), and step 2 only mode (e,f). The obtained models were compared with the reference models.
Fig 3
Fig 3
Genome sizes and gene models of the genome assemblies in this study. (a) The relationship between the gene number and the genome size. (b) The relationship between the maker-models and the blastn-models. (c) The relationship between the gene density and the mean contig length. (d) The relationship between the mean gene length and the mean contig length. (e) The distribution of the mean gene length. (f) The relationship between the mean gene length and the maker-model ratio.
Fig 4
Fig 4
Comparison with gene models obtained in other studies.
Fig 5
Fig 5
Gene families obtained based on different cluster cutoffs. Five cluster cutoffs, 50%, 60%, 70%, 80%, and 90%, were used. Four sets of genomes and the strain S288c genome were analyzed. All genomes, black; non-redundant, red; medium-high-quality, blue; high-quality, magenta; strain S288c, dark green. The number of genomes for each set was indicated in parentheses.
Fig 6
Fig 6
Distribution of gene families in different sets of genome assemblies. (a) All 2,507 assemblies. (b) The non-redundant set, including 2,116 strains. (c) The medium-high-quality set, including 847 strains. (d) The high-quality set, including 117 strains.
Fig 7
Fig 7
Evaluation of the genome completeness based on the marker gene set. (a) The distribution of completeness for all assemblies. (b) The relationship between the average homolog number in the marker gene families and the total gene number in the genome.
Fig 8
Fig 8
Evaluation of the base quality of the genomes. (a) Sequence identities of the marker genes between the high-quality set of genomes and the reference S288c. (b and c) The relationship between the marker gene sequence identities and the genome completeness for all assemblies except the high-quality set. For the marker gene family containing multiple homologs from one assembly, the one with the highest identity was analyzed.
Fig 9
Fig 9
Pangenomic analyses based on the medium-high-quality set. (a–c) The total number of extended core gene families (a), character gene families (b), and accessory gene families (c) found in the medium-high-quality set. (d–f) The average number of gene families (blue) and sequences (orange) for extended core genes (d), character genes (e), and accessory genes (f) found in each medium-high-quality genome.
Fig 10
Fig 10
Gene families related to the environments/biotechnological potentials, geographic locations, and phylogenetic positions of strains. (a) The numbers of gene families recovered by different grouping schemes. (b) The numbers of gene families obtained by randomized grouping. (c) Venn diagram of gene families obtained by different schemes.

Similar articles

Cited by

References

    1. Liu L, Redden H, Alper HS. 2013. Frontiers of yeast metabolic engineering: diversifying beyond ethanol and Saccharomyces. Curr Opin Biotechnol 24:1023–1030. doi:10.1016/j.copbio.2013.03.005 - DOI - PubMed
    1. Marsit S, Leducq JB, Durand É, Marchant A, Filteau M, Landry CR. 2017. Evolutionary biology through the lens of budding yeast comparative genomics. Nat Rev Genet 18:581–598. doi:10.1038/nrg.2017.49 - DOI - PubMed
    1. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG. 1996. Life with 6000 genes. Science 274:546–567. doi:10.1126/science.274.5287.546 - DOI - PubMed
    1. Peter J, De Chiara M, Friedrich A, Yue J-X, Pflieger D, Bergström A, Sigwalt A, Barre B, Freel K, Llored A, Cruaud C, Labadie K, Aury J-M, Istace B, Lebrigand K, Barbry P, Engelen S, Lemainque A, Wincker P, Liti G, Schacherer J. 2018. Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature 556:339–344. doi:10.1038/s41586-018-0030-5 - DOI - PMC - PubMed
    1. Li G, Ji B, Nielsen J. 2019. The pan-genome of Saccharomyces cerevisiae. FEMS Yeast Res 19:foz064. doi:10.1093/femsyr/foz064 - DOI - PubMed

LinkOut - more resources