Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 1:2020:baaa037.
doi: 10.1093/database/baaa037.

Automated generation of gene summaries at the Alliance of Genome Resources

Affiliations

Automated generation of gene summaries at the Alliance of Genome Resources

Ranjana Kishore et al. Database (Oxford). .

Abstract

Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, we developed a new algorithm that traverses ontology graphs in order to group terms by their common ancestors. The algorithm optimizes the coverage of the initial set of terms and limits the length of the final summary, using measures of information content of each ontology term as a criterion for inclusion in the summary. The automated gene summaries are generated with each Alliance release, ensuring that they reflect current data at the Alliance. Our method effectively leverages category-specific curation efforts of the Alliance member databases to create modular, structured and standardized gene summaries for seven member species of the Alliance. These automatically generated gene summaries make cross-species gene function comparisons tenable and increase discoverability of potential models of human disease. In addition to being displayed on Alliance gene pages, these summaries are also included on several MOK gene pages.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow diagram of the gene summary generation process at the Alliance of Genome Resources. Solid arrows represent sequential steps followed by the software to generate the summaries, whereas dashed arrows represent data flow from/to the algorithm to data stores. Ontologies and annotations are loaded and represented as graphs. Then, term filters and renaming are applied to the annotations and to the ontology graphs. For each gene, basic information such as gene ID, name and additional information such as orthology data are fetched. The list of terms associated with the gene is extracted from the Alliance database and sentences are generated according to the templates defined for each data category. If the list of terms exceeds the defined maximum number, the trimming algorithm reduces the length of the sentence by traversing the related ontology graph and by selecting the common ancestors that best group the initial set of terms. Ontology graphs are also used to resolve parent–child relationships to avoid including both parent and child terms in the final summaries. The final summaries are generated by concatenating the data category specific sentences and are written to the Alliance database and to the download files available on the Alliance website.
Figure 2
Figure 2
Example of the gene summary for the C. elegans gene cdk-4 with the different data categories highlighted in different boxes.
Figure 3
Figure 3
A portion of the C. elegans anatomy ontology graph (generated with the WB SObA tool; 33). Terms circled in red (single solid circle) represent the initial set of annotated terms for the C. elegans gene abf-1. Dashed blue circles are the ancestors of the initial terms and green circles (double circle) are their respective LCAs in the ontology (excluding the root node). The terms pharynx and neuron (marked by yellow squares) are chosen by the trimming algorithm as they are the only LCAs at the predefined minimum distance from the root (depicted as a dashed horizontal line), which in this example is set to 3. Note that the distance of a term from the root is the length of the longest path between them, which is highlighted in the figure as an example for the term neuron.
Figure 4
Figure 4
Untrimmed and trimmed summaries for the zebrafish gene sox17. (A) Untrimmed summary that shows all the 25 terms annotated to the gene. (B) The summary trimmed with the LCA-based algorithm. (C) The summary trimmed with the algorithm based on ICSanchez. Text highlighted in purple indicates the tissue expression data category, which has 17 terms in the untrimmed summary. Text in bold shows the difference between (B) and (C).

References

    1. Harris T.W., Arnaboldi V., Chan J. et al. (2019) WormBase: a modern model organism information resource. Nucleic Acids Res., 48, D762–D767. doi: 10.1093/nar/gkz920. - DOI - PMC - PubMed
    1. Ng P.C., Wong E.D., MacPherson K.A. et al. (2019) Transcriptome visualization and data availability at the Saccharomyces Genome Database. Nucleic Acids Res., 48, D743–D748. doi: 10.1093/nar/gkz892. - DOI - PMC - PubMed
    1. Cherry J.M., Hong E.L., Amundsen C. et al. (2012) Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res., 40, D700–D705. doi: 10.1093/nar/gkr1029. - DOI - PMC - PubMed
    1. Harris T.W., Antoshechkin I., Bieri T. et al. (2010) WormBase: a comprehensive resource for nematode research. Nucleic Acids Res., 38, D463–D467. doi: 10.1093/nar/gkp952. - DOI - PMC - PubMed
    1. Thurmond J., Goodman J.L., Strelets V.B. et al. (2019) FlyBase 2.0: the next generation. Nucleic Acids Res., 47, D759–D765. doi: 10.1093/nar/gky1003. - DOI - PMC - PubMed

Publication types