Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 2;25(1):6.
doi: 10.1186/s12864-023-09924-y.

Functional annotation of a divergent genome using sequence and structure-based similarity

Affiliations

Functional annotation of a divergent genome using sequence and structure-based similarity

Dennis Svedberg et al. BMC Genomics. .

Abstract

Background: Microsporidia are a large taxon of intracellular pathogens characterized by extraordinarily streamlined genomes with unusually high sequence divergence and many species-specific adaptations. These unique factors pose challenges for traditional genome annotation methods based on sequence similarity. As a result, many of the microsporidian genomes sequenced to date contain numerous genes of unknown function. Recent innovations in rapid and accurate structure prediction and comparison, together with the growing amount of data in structural databases, provide new opportunities to assist in the functional annotation of newly sequenced genomes.

Results: In this study, we established a workflow that combines sequence and structure-based functional gene annotation approaches employing a ChimeraX plugin named ANNOTEX (Annotation Extension for ChimeraX), allowing for visual inspection and manual curation. We employed this workflow on a high-quality telomere-to-telomere sequenced tetraploid genome of Vairimorpha necatrix. First, the 3080 predicted protein-coding DNA sequences, of which 89% were confirmed with RNA sequencing data, were used as input. Next, ColabFold was used to create protein structure predictions, followed by a Foldseek search for structural matching to the PDB and AlphaFold databases. The subsequent manual curation, using sequence and structure-based hits, increased the accuracy and quality of the functional genome annotation compared to results using only traditional annotation tools. Our workflow resulted in a comprehensive description of the V. necatrix genome, along with a structural summary of the most prevalent protein groups, such as the ricin B lectin family. In addition, and to test our tool, we identified the functions of several previously uncharacterized Encephalitozoon cuniculi genes.

Conclusion: We provide a new functional annotation tool for divergent organisms and employ it on a newly sequenced, high-quality microsporidian genome to shed light on this uncharacterized intracellular pathogen of Lepidoptera. The addition of a structure-based annotation approach can serve as a valuable template for studying other microsporidian or similarly divergent species.

Keywords: Functional annotation; Genome; Microsporidia; Polar tube proteins; Ricin B lectins; Structural similarity; Vairimorpha necatrix.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Functional annotation of V. necatrix genes using structure-prediction and sequence-based comparative analyses. a) A phylogenetic tree based on [25] with 24 microsporidian species, and 3 outgroup species plus S. cerevisiae (grey branches). The bar graphs show the respective genome sizes and the number of proteins used (colored) and folded for our structural comparison. b) Schematic pipeline of our structural similarity approach, from protein structure prediction with ColabFold (v1.5.2) to structural matching using Foldseek (v5-53465f0), followed by a manual curation step with ANNOTEX that includes a comparison of sequence and structure-based hits to achieve a high-quality functional annotation
Fig. 2
Fig. 2
The annotated genome of V. necatrix (a) Pie chart summarizing the functional annotation output using a combination of sequence and structure-based hits and experimental data. Compared to ProtNLM or eggNOG (yellow, marked by black dashed lines), our complementary approach improved the genome annotation by an additional 319 final curated gene functions, here shown in yellow. Further, 107 experimentally solved protein structures (black) from PDB are listed as structural matches. 220 genes that have homologs in other microsporidia, but are of unknown function, are presented in dark grey. Light grey represents 928 hypothetical V. necatrix genes that have no matches to the known genes of other microsporidia. (b) Approximate localization of the rDNA genes 16 S/23S (blue) and 5 S (green) on the 12 chromosomes of the two predominant pseud-haplotypes 1 (black) and 2 (grey). The insert depicts one rDNA in shades of blue (light blue for the 16 S, dark blue for the 23 S) and one 5 S gene in green. The internal transcribed spacer (ITS) is shown in yellow. (c) Structure-based network of highly abundant protein-fold families encoded by our V. necatrix genome. AlphaFold-predicted protein models were analyzed for structural relatedness in a Foldseek all-against-all search. The structural similarity is represented by the TM score which is used as a measure for the protein network graph generated in Gephi (v0.9.2). Each node represents a protein colored according to its fold family. Proteins with inverted surrounding and filling color compared to the main cluster have an additional common domain besides the one unifying the main cluster i.e., Clp R domain-containing proteins and actin(-like) proteins. Connecting lines indicate structural relation of proteins and thicker lines indicate greater structural similarity. PTP6, polar tube protein 6; RBL, ricin B lectin; MCM, minichromosome maintenance; Serpin-type protein, serine-protease inhibitor type protein; MULE domain, Mutator-like elements domain; Tr-type G domain, translation-type guanosine-binding domain; SP, signal peptide; Clp R domain, caseinolytic protease repeat domain; AAA+, ATPases associated with diverse cellular activities
Fig. 3
Fig. 3
Complementation of structure and sequence-based functional annotation enriches the total number of matches and improves the annotation of microsporidia-specific genes. (a) To assess the annotation efficiency of our combined structure and sequence-based similarity approach, we counted the amount of identical (green), non-identical (dark grey), not identified (light grey) and experimentally determined (black) functional gene predictions between ANNOTEX and ProtNLM. Additionally, we display the relative number of potential miss-annotations (dark grey with black dashed line) predicted by ProtNLM and the percentage of ProtNLM gene function predictions with a model score above 0.2 (dark green dashed line) that we transferred to genes which our approach suggested to be uncharacterized or hypothetical. (b) Employing our approach, we functionally annotated 12% (dark green) and characterized the domain of 7% (green) of the 381 uncharacterized E. cuniculi proteins. RBLL-1, ricin B lectin-like 1
Fig. 4
Fig. 4
Examples of high-confidence structure-based hits for BUSCO genes, cell-division cycle and endoplasmic reticulum resident proteins. (a) BUSCO scores of a selection of microsporidian genomes compared to the score of V. necatrix. The genus Encephalitozoon is colored light grey. The V. necatrix BUSCO score bar is colored yellow with an extension in green representing the four additional genes identified using Foldseek. (b) AlphaFold structures of E. cuniculi (magenta) and V. necatrix (gold) proteins corresponding to the four microsporidia BUSCO genes. These four genes were exclusively identified via structural matching due to their low protein sequence identity. (c) Unambiguous identification of cell-division control protein 45, endoplasmic reticulum resident protein 44 and coiled-coil domain-containing protein 47 through structural similarity searches. Sequence-based searches lead to moderate-to-low-confidence hits comprising uncharacterized proteins, annotated protein domains or proteins with incorrect functional annotation. Sequence identity was calculated with ClustalW (v2.1), and TM scores were generated using TM-align (https://zhanggroup.org/TM-align/). TM score was normalized according to the length of the reference protein. Gold: Identified microsporidian proteins; magenta: Homologs; AF, AlphaFold; PDB, Protein Data Bank
Fig. 5
Fig. 5
Structure-based identification and classification of the abundant RBL protein family. a) Cladogram of Nosematida RBLs named based on available experimental data (PTP4, PTP5, PTP6, RBLL-1) and otherwise termed RBL1 through RBL9. Branches marked with stars indicate a bootstrap value > 70. Protein IDs with asterisks indicate existing publications on the respective gene, hashtag marks indicate previously identified orthologs to NbPTP6 [86], and proteins in bold with a light grey background indicate the corresponding ten most highly expressed genes during germination. b) Structure-based network of RBL domain folds color-coded according to their clade in a). Each node represents one RBL domain, connecting lines indicate the degree of structural relatedness, and surrounding shapes in brighter shades mark structural clusters. Protein folds of all RBLs identified in a) were predicted with AlphaFold and RBL domains were clustered according to structural similarity based on their TM score using Gephi (v0.9.2) [91]. RBL8 was excluded as the AlphaFold prediction was of very low confidence. c) AlphaFold-predicted protein structures for the PTP4s and PTP5s comparing tertiary structures of the RBL domain between the two protein families and the microsporidian families. E.c., Encephalitozoon cuniculi; E.h., Encephalitozoon hellem; E.r. Encephalitozoon romaleae; N.b., Nosema bombycis; O.c., Ordospora colligata; V.n., Vairimorpha necatrix; V.c., Vairimorpha ceranae; RBL, ricin B lectin; RBLL, ricin B lectin-like; PTP, polar tube protein

References

    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10. doi: 10.1016/S0022-2836(05)80360-2. - DOI - PubMed
    1. Yunes JM, Babbitt PC. Effusion: prediction of protein function from sequence similarity networks. Bioinformatics. 2019;35(3):442–51. doi: 10.1093/bioinformatics/bty672. - DOI - PMC - PubMed
    1. Higdon R, Louie B, Kolker E. Modeling sequence and function similarity between proteins for protein functional annotation. Proc Int Symp High Perform Distrib Comput. 2010;2010:499–502. - PMC - PubMed
    1. Corradi N, Pombert JF, Farinelli L, Didier ES, Keeling PJ. The complete sequence of the smallest known nuclear genome from the microsporidian Encephalitozoon Intestinalis. Nat Commun. 2010;1:77. doi: 10.1038/ncomms1082. - DOI - PMC - PubMed