Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar 17;45(5):2629-2643.
doi: 10.1093/nar/gkx006.

Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis

Affiliations

Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis

Yafeng Zhu et al. Nucleic Acids Res. .

Abstract

Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotation. Through long-read DNA sequencing, we obtained a gap-free genome assembly for M. sympodialis (ATCC 42132), comprising eight nuclear and one mitochondrial chromosome. We also sequenced and assembled four M. sympodialis clinical isolates, and showed their value for understanding Malassezia reproduction by confirming four alternative allele combinations at the two mating-type loci. Importantly, we demonstrated how proteomics data could be readily integrated with transcriptomics data in standard annotation tools. This increased the number of annotated protein-coding genes by 14% (from 3612 to 4113), compared to using transcriptomics evidence alone. Manual curation further increased the number of protein-coding genes by 9% (to 4493). All of these genes have RNA-seq evidence and 87% were confirmed by proteomics. The M. sympodialis genome assembly and annotation presented here is at a quality yet achieved only for a few eukaryotic organisms, and constitutes an important reference for future host-microbe interaction studies.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Integrative genome annotation workflow. Data from four different sources (long-read DNA sequencing, RNA-seq, MS-based proteomics and Swiss-Prot reviewed proteins) were integrated using an evidence-based genome annotation framework (MAKER). Transcripts were assembled from RNA-seq reads using Trinity and PASA was used to identify likely protein-coding regions to provide gene models for initial gene predictions. Three ab initio gene predictors (GeneMark-ES, Augustus and SNAP) were included in MAKER. Augustus and SNAP were iteratively trained based on MAKER-generated gene models (see Materials and Methods and Supplementary Table S2). The computationally inferred gene structures were manually curated. Shapes are used according to workflow figure standards (rectangles show processes, data are in parallelograms, the trapezoid indicates a manual step and the rounded rectangle represents output).
Figure 2.
Figure 2.
Gene annotation facilitated by RNA-seq and peptide evidence. Screenshot from the WebApollo genome annotation editor showing a locus where RNA-seq and peptide evidence improved gene annotation compared to the previous annotation described by Gioti et al. (5). The 5΄-UTR and protein-coding segments were identified by the MAKER-based pipeline integrating RNA-seq and peptide data. Manual curation added a 3΄-UTR (uppermost track). The colors of exons and peptides indicate reading frame, such that exons and peptides with the same color are in the same reading frame. UTRs are indicated in purple and introns in gray. RNA-seq coverage is shown for the genomic minus strand (i.e. the strand of the annotated gene) and indicates the number of read pairs at each base.
Figure 3.
Figure 3.
Increases in coding sequence and intron detection through addition of RNA-seq and proteomics data. Percentages were calculated using the values (length of coding sequences and total number of introns) from the manually curated annotation as denominator.
Figure 4.
Figure 4.
Experimental support for the final set of protein-coding genes. (A) Number of unique peptides per gene. (B) Number of RNA-seq read pairs per gene. (C) Relation between peptide and RNA-seq support. The uppermost curve shows the cumulative distribution of the number of unique peptides per gene, for all protein-coding genes. Genes were additionally stratified by the number of supporting strand-specific RNA-seq read pairs, and the area under the curve colored accordingly (inset legend). To be conservative, read pairs were only counted if uniquely mapped within annotated coding sequences, i.e. reads containing UTRs or other non-coding sequences were excluded. Note the use of logarithmic scale for y-axis (A and B) and x- axis (C).
Figure 5.
Figure 5.
Pfam domain content in different annotation sets compared to reference species. The percentage of proteins with Pfam domains in M. sympodialis annotation was calculated using the total number of genes after manual curation as denominator. The numbers of M. sympodialis proteins with Pfam domains identified from different annotations sets were 2595 in the Gioti annotation (MAKER with homology evidence) (5), 2647 in MAKER annotation with homology and RNA-seq evidence, 2903 in MAKER annotation with homology, RNA-seq and peptide evidence, and 3173 after manual annotation.
Figure 6.
Figure 6.
Evidence of multiple mitochondrial genome configurations. The physical map of the mitochondrial DNA (mtDNA) is displayed in a linear form, beginning with the rnl gene. Rectangles indicate genes or exons of highly conserved protein-coding regions (black), ribosomal RNAs (blue) and intron-encoded homing endonuclease genes (grey). The unit-length, monomeric mtDNA contains a large inverted repeat (purple), separated by an intra-repeat region. The intra-repeat and flanking region is shown below, with the position of tRNAs met (M) and his (H) indicated in green. SMRT reads demonstrated that the intra-repeat region exists in two orientations relative to the inverted repeats.
Figure 7.
Figure 7.
Phylogeny of the MAT loci and mating type designations of the M. sympodialis isolates. Phylogenetic relationships among the five sequenced M. sympodialis genomes (Table 3) at the HD and PR loci. The allele designation for each genome is shown in brackets. Scale bars indicate the number of substitutions per site. Bootstrap values are based on 1000 replications.
Figure 8.
Figure 8.
Phylogeny of the three components of the HD locus. Shown on the left is a schematic diagram of the HD locus. The red dashed rectangle indicates the region where the HD2 allele of isolate KS004 is identical to the HD1 allele (see text). Shown on the right are the phylogenies of the five M. sympodialis genomes (see Table 3) for each of the three components of the HD locus. Scale bars indicate the number of substitutions per site. Bootstrap values are based on 1000 replications.

References

    1. Oh J., Byrd A.L., Deming C., Conlan S., Program N.C.S., Kong H.H., Segre J.A.. Biogeography and individuality shape function in the human skin metagenome. Nature. 2014; 514:59–64. - PMC - PubMed
    1. Findley K., Oh J., Yang J., Conlan S., Deming C., Meyer J.A., Schoenfeld D., Nomicos E., Park M., Kong H.H. et al. Topographic diversity of fungal and bacterial communities in human skin. Nature. 2013; 498:367–370. - PMC - PubMed
    1. Gemmer C.M., DeAngelis Y.M., Theelen B., Boekhout T., Dawson T.L. Jr. Fast, noninvasive method for molecular detection and differentiation of Malassezia yeast species on human skin and application of the method to dandruff microbiology. J. Clin. Microbiol. 2002; 40:3350–3357. - PMC - PubMed
    1. Saunders C.W., Scheynius A., Heitman J.. Malassezia fungi are specialized to live on skin and associated with dandruff, eczema, and other skin diseases. PLoS Pathog. 2012; 8:e1002701. - PMC - PubMed
    1. Gioti A., Nystedt B., Li W.J., Xu J., Andersson A., Averette A.F., Munch K., Wang X.Y., Kappauf C., Kingsbury J.M. et al. Genomic insights into the atopic eczema-associated skin commensal yeast malassezia sympodialis. MBio. 2013; 4, doi:10.1128/mBio.00572-12. - PMC - PubMed