. 2018 Mar 15:9:325.

doi: 10.3389/fpls.2018.00325. eCollection 2018.

An Updated Functional Annotation of Protein-Coding Genes in the Cucumber Genome

Hongtao Song¹, Kui Lin¹, Jinglu Hu², Erli Pang¹

Affiliations

¹ MOE Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China.
² Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, Japan.

PMID: 29599790
PMCID: PMC5863696
DOI: 10.3389/fpls.2018.00325

An Updated Functional Annotation of Protein-Coding Genes in the Cucumber Genome

Hongtao Song et al. Front Plant Sci. 2018.

. 2018 Mar 15:9:325.

doi: 10.3389/fpls.2018.00325. eCollection 2018.

Authors

Hongtao Song¹, Kui Lin¹, Jinglu Hu², Erli Pang¹

Affiliations

¹ MOE Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China.
² Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, Japan.

PMID: 29599790
PMCID: PMC5863696
DOI: 10.3389/fpls.2018.00325

Abstract

Background: Although the cucumber reference genome and its annotation were published several years ago, the functional annotation of predicted genes, particularly protein-coding genes, still requires further improvement. In general, accurately determining orthologous relationships between genes allows for better and more robust functional assignments of predicted genes. As one of the most reliable strategies, the determination of collinearity information may facilitate reliable orthology inferences among genes from multiple related genomes. Currently, the identification of collinear segments has mainly been based on conservation of gene order and orientation. Over the course of plant genome evolution, various evolutionary events have disrupted or distorted the order of genes along chromosomes, making it difficult to use those genes as genome-wide markers for plant genome comparisons. Results: Using the localized LASTZ/MULTIZ analysis pipeline, we aligned 15 genomes, including cucumber and other related angiosperm plants, and identified a set of genomic segments that are short in length, stable in structure, uniform in distribution and highly conserved across all 15 plants. Compared with protein-coding genes, these conserved segments were more suitable for use as genomic markers for detecting collinear segments among distantly divergent plants. Guided by this set of identified collinear genomic segments, we inferred 94,486 orthologous protein-coding gene pairs (OPPs) between cucumber and 14 other angiosperm species, which were used as proxies for transferring functional terms to cucumber genes from the annotations of the other 14 genomes. In total, 10,885 protein-coding genes were assigned Gene Ontology (GO) terms which was nearly 1,300 more than results collected in Uniprot-proteomic database. Our results showed that annotation accuracy would been improved compared with other existing approaches. Conclusions: In this study, we provided an alternative resource for the functional annotation of predicted cucumber protein-coding genes, which we expect will be beneficial for the cucumber's biological study, accessible from http://cmb.bnu.edu.cn/functional_annotation. Meanwhile, using the cucumber reference genome as a case study, we presented an efficient strategy for transferring gene functional information from previously well-characterized protein-coding genes in model species to newly sequenced or "non-model" plant species.

Keywords: collinear segments; cucumber; gene functional annotation; orthology; protein-coding gene.

PubMed Disclaimer

Figures

**Figure 1**
Phylogenetic tree of species included in the 15-way cucumber-based alignment, used to guide MULTIZ merging of pairwise alignments. The neutral tree is based on four-fold degenerate sites sampled from Chr1 ~7 with branch lengths proportional to the indicated scale.

**Figure 2**
Alignment of *Cucumis sativus* genome features with corresponding features in species at increasing phylogenetic distances. The solid red line indicates the genome alignability while the blue dashed line represents previously described CDS regions (Hupalo and Kern, 2013); CDS, coding sequences; UTR, untranslated region; ncRNA, noncoding RNA; TE, transposable elements; TR, tandem repeats. MBE indicates the results from one previous study (Hupalo and Kern, 2013) published in molecular biology and evolution. “Substitutions per Site” lists the divergence from cucumber based on the phylogeny in Figure 1.

**Figure 3**
Multiple alignment anchors composition compared with the genome features of cucumber. **(Left)** Composition of Multiple Alignment Anchors; **(Right)** Composition of *Cucumis sativus* V2. CDS, coding sequence; NC, noncoding RNA; UTR, untranslated region, gene deserts: intergenic length between two adjacent genes > 30 Kbp.

**Figure 4**
Distribution of n-way **(A)** and 2way-d **(B)** collinear segments (MAAs-based) from 15 angiosperm genomes. The n-way (n∈ {3,4, …, 15}) collinear segments indicate the group of species consisting of cucumber and other related species which were sequentially incorporated based on the topology of the tree in Figure 1, with cucumber as the origin. These segments represent the multiple species level of collinear segments. Each 2way-d (d∈ {2,3,…,15}, where d is the species index) collinear segment represented a pairwise alignment with one of the 14 non-cucumber species indexed by d, where d was incremented with the degree of divergence from cucumber according to the phylogenetic tree (Figure 1). Thus, d represents, in ascending order, Cucumis melo, Citrullus lanatus, Malus domestica, Glycine max, Populus trichocarpa, Citrus sinensis, Brassica rapa, Arabidopsis thaliana, Arabidopsis lyrata, Vitis vinifera, Solanum tuberosum, Setaria italica, Brachypodium distachyon, and Oryza brachyantha.

**Figure 5**
Distribution of OPPs (orthologous protein-coding genes pairs) inferred by different levels of collinear segments. Red bar indicates the results using MAAs as genomic markers; Blue bar is from results using protein-coding genes as markers. The n-way(n∈ {3,4, …, 15}) collinear segments indicate group of species consisting of cucumber and other related species which were sequentially incorporated based on the topology of the tree in Figure 1, with cucumber as the origin. These segments represent the multiple species level of collinear segments. Each 2way-d (d∈ {2,3,…,15}, where d is the species index) collinear segment represented a pairwise alignment with one of the other 14 non-cucumber species indexed by d, where d was incremented with the degree of divergence from cucumber according to the phylogenetic tree (Figure 1). Thus, d represents, in ascending order, Cucumis melo, Citrullus lanatus, Malus domestica, Glycine max, Populus trichocarpa, Citrus sinensis, Brassica rapa, Arabidopsis thaliana, Arabidopsis lyrata, Vitis vinifera, Solanum tuberosum, Setaria italica, Brachypodium distachyon, and *Oryza brachyantha*.

**Figure 6**
Bar plot of OPSS (orthologous protein-coding genes pair support score) distributions for OPPs (orthologous protein-coding gene pairs). Vertical axis indicates the number of OPPs, and the horizontal ordinate represents OPSS value intervals. The green curve is the fitted density.

**Figure 7**
Summary of annotations from 8 pipelines. To compare our annotation results with traditional methods, we selected six widely used pipelines, using default parameters, to assign GO terms to cucumber protein-coding genes, including Blast2GO (abbreviated as b2go), InterProScan (abbreviated as ips2go), OrthoMCL (abbreviated as orthomcl2go), Trinotate-Blast (abbreviated as trib2go), Trinotate-Pfam (abbreviated as trip2go), and the UniProt resource (abbreviated as uniprot). Detailed settings for each pipeline are described in materials and methods. Our pipelines, using MAAs as genomic markers [abbreviated as opp2go (MAAs)] or protein-coding genes [abbreviated as opp2go (protein-coding genes)]. **(A)** Coverage of annotated genes; **(B)** numbers of GO terms covered by annotations; **(C)** average number of GO terms associated with each gene; **(D)** average number of genes annotated by each GO term. “Subset” indicates that GO terms associated with > 300 genes or <5 genes were filtered out.

**Figure 8**
Comparison of annotation results among 8 different pipelines. opp2go (MAAs): our pipeline using MAAs as genomic markers, opp2go (proteins): our pipeline using protein-coding genes as genomic markers, b2go: Blast2GO, ips2go: InterProScan pipeline, orthomcl2go: OrthoMCL pipeline, trib2go: Trinotate-Blast; trip2go: Trinotate-pfam, and uniprot: UniProt resource. MF, molecular function; BP, biological process; CC, cellular component. Top-left indicates the Jaccard similarity for the MF subset (structure-free); Top-right shows annotated gene set similarity (structure-free); Bottom-left indicates the sematic similarity for MF subset (structure-based); Bottom-right indicates the sematic similarity for BP subset (structure-based).

**Figure 9**
Validation of annotations based on gene co-expression. opp_MAAs, our pipeline using MAAs as genomic markers; opp_protein, our pipeline using protein-coding genes as genomic markers; b2g, Blast2GO; ips, InterProScan pipeline; orthomcl, OrthoMCL pipeline; trib, Trinotate-Blast; trip, Trinotate-pfam, and uniprot, UniProt resource. Given a set of cucumber genes linked to a BP/MF/CC term by a specific pipeline, the average Pearson correlation coefficient for co-expression of genes was compared to that of a random gene set. For each pipeline the number of GO terms with p-value < 0.001 was indicated.

See this image and copyright information in PMC

Cited by

Identification of clade-wide putative cis-regulatory elements from conserved non-coding sequences in Cucurbitaceae genomes.
Song H, Wang Q, Zhang Z, Lin K, Pang E. Song H, et al. Hortic Res. 2023 Feb 28;10(4):uhad038. doi: 10.1093/hr/uhad038. eCollection 2023 Apr. Hortic Res. 2023. PMID: 37799630 Free PMC article.

References

1. Amar D., Frades I., Danek A., Goldberg T., Sharma S. K., Hedley P. E., et al. (2014). Evaluation and integration of functional annotation pipelines for newly sequenced organisms: the potato genome as a test case. BMC Plant Biol. 14:329. 10.1186/s12870-014-0329-9 - DOI - PMC - PubMed
1. Bennetzen J. L., Schmutz J., Wang H., Percifield R., Hawkins J., Pontaroli A. C., et al. (2012). Reference genome sequence of the model plant Setaria. Nat. Biotechnol. 30, 555–561. 10.1038/nbt.2196 - DOI - PubMed
1. Blanchette M., Kent W. J., Riemer C., Elnitski L., Smit A. F., Roskin K. M., et al. (2004). Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715. 10.1101/gr.1933104 - DOI - PMC - PubMed
1. Bowers J. E., Chapman B. A., Rong J., Paterson A. H. (2003). Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438. 10.1038/nature01521 - DOI - PubMed
1. Burge S., Kelly E., Lonsdale D., Mutowo-Muellenet P., McAnulla C., Mitchell A., et al. (2012). Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation. Database 2012:bar068. 10.1093/database/bar068 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An Updated Functional Annotation of Protein-Coding Genes in the Cucumber Genome

Affiliations

An Updated Functional Annotation of Protein-Coding Genes in the Cucumber Genome

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials