Automated generation of gene summaries at the Alliance of Genome Resources

Affiliations

¹ WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA.
² ZFIN, The Institute of Neuroscience, 222 Huestis Hall, University of Oregon, Eugene, OR 97403-1254, USA.
³ Saccharomyces Genome Database, Department of Genetics, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304, USA.
⁴ FlyBase, Department of Physiology, Development and Neuroscience, 7 Downing Pl, University of Cambridge, Cambridge CB2 3DY, UK.
⁵ MGI, The Jackson Laboratory, Bar Harbor, ME 04609, USA.
⁶ Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin and Marquette University, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA.

PMID: 32559296
PMCID: PMC7304461
DOI: 10.1093/database/baaa037

Automated generation of gene summaries at the Alliance of Genome Resources

Ranjana Kishore et al. Database (Oxford). 2020.

. 2020 Jan 1:2020:baaa037.

doi: 10.1093/database/baaa037.

Affiliations

¹ WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA.
² ZFIN, The Institute of Neuroscience, 222 Huestis Hall, University of Oregon, Eugene, OR 97403-1254, USA.
³ Saccharomyces Genome Database, Department of Genetics, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304, USA.
⁴ FlyBase, Department of Physiology, Development and Neuroscience, 7 Downing Pl, University of Cambridge, Cambridge CB2 3DY, UK.
⁵ MGI, The Jackson Laboratory, Bar Harbor, ME 04609, USA.
⁶ Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin and Marquette University, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA.

PMID: 32559296
PMCID: PMC7304461
DOI: 10.1093/database/baaa037

Abstract

Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, we developed a new algorithm that traverses ontology graphs in order to group terms by their common ancestors. The algorithm optimizes the coverage of the initial set of terms and limits the length of the final summary, using measures of information content of each ontology term as a criterion for inclusion in the summary. The automated gene summaries are generated with each Alliance release, ensuring that they reflect current data at the Alliance. Our method effectively leverages category-specific curation efforts of the Alliance member databases to create modular, structured and standardized gene summaries for seven member species of the Alliance. These automatically generated gene summaries make cross-species gene function comparisons tenable and increase discoverability of potential models of human disease. In addition to being displayed on Alliance gene pages, these summaries are also included on several MOK gene pages.

PubMed Disclaimer

Figures

**Figure 1**
Workflow diagram of the gene summary generation process at the Alliance of Genome Resources. Solid arrows represent sequential steps followed by the software to generate the summaries, whereas dashed arrows represent data flow from/to the algorithm to data stores. Ontologies and annotations are loaded and represented as graphs. Then, term filters and renaming are applied to the annotations and to the ontology graphs. For each gene, basic information such as gene ID, name and additional information such as orthology data are fetched. The list of terms associated with the gene is extracted from the Alliance database and sentences are generated according to the templates defined for each data category. If the list of terms exceeds the defined maximum number, the trimming algorithm reduces the length of the sentence by traversing the related ontology graph and by selecting the common ancestors that best group the initial set of terms. Ontology graphs are also used to resolve parent–child relationships to avoid including both parent and child terms in the final summaries. The final summaries are generated by concatenating the data category specific sentences and are written to the Alliance database and to the download files available on the Alliance website.

**Figure 2**
Example of the gene summary for the *C. elegans* gene *cdk-4* with the different data categories highlighted in different boxes.

**Figure 3**
A portion of the *C. elegans* anatomy ontology graph (generated with the WB SObA tool; 33). Terms circled in red (single solid circle) represent the initial set of annotated terms for the *C. elegans* gene *abf-1*. Dashed blue circles are the ancestors of the initial terms and green circles (double circle) are their respective LCAs in the ontology (excluding the root node). The terms pharynx and neuron (marked by yellow squares) are chosen by the trimming algorithm as they are the only LCAs at the predefined minimum distance from the root (depicted as a dashed horizontal line), which in this example is set to 3. Note that the distance of a term from the root is the length of the longest path between them, which is highlighted in the figure as an example for the term neuron.

**Figure 4**
Untrimmed and trimmed summaries for the zebrafish gene *sox17*. (A) Untrimmed summary that shows all the 25 terms annotated to the gene. (B) The summary trimmed with the LCA-based algorithm. (C) The summary trimmed with the algorithm based on IC_Sanchez. Text highlighted in purple indicates the tissue expression data category, which has 17 terms in the untrimmed summary. Text in bold shows the difference between (B) and (C).

See this image and copyright information in PMC

References

1. Harris T.W., Arnaboldi V., Chan J. et al. (2019) WormBase: a modern model organism information resource. Nucleic Acids Res., 48, D762–D767. doi: 10.1093/nar/gkz920. - DOI - PMC - PubMed
1. Ng P.C., Wong E.D., MacPherson K.A. et al. (2019) Transcriptome visualization and data availability at the Saccharomyces Genome Database. Nucleic Acids Res., 48, D743–D748. doi: 10.1093/nar/gkz892. - DOI - PMC - PubMed
1. Cherry J.M., Hong E.L., Amundsen C. et al. (2012) Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res., 40, D700–D705. doi: 10.1093/nar/gkr1029. - DOI - PMC - PubMed
1. Harris T.W., Antoshechkin I., Bieri T. et al. (2010) WormBase: a comprehensive resource for nematode research. Nucleic Acids Res., 38, D463–D467. doi: 10.1093/nar/gkp952. - DOI - PMC - PubMed
1. Thurmond J., Goodman J.L., Strelets V.B. et al. (2019) FlyBase 2.0: the next generation. Nucleic Acids Res., 47, D759–D765. doi: 10.1093/nar/gky1003. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated generation of gene summaries at the Alliance of Genome Resources

Affiliations

Automated generation of gene summaries at the Alliance of Genome Resources

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources