Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug;26(7-8):295-304.
doi: 10.1007/s00335-015-9571-1. Epub 2015 Jun 18.

A unified gene catalog for the laboratory mouse reference genome

Affiliations

A unified gene catalog for the laboratory mouse reference genome

Y Zhu et al. Mamm Genome. 2015 Aug.

Abstract

We report here a semi-automated process by which mouse genome feature predictions and curated annotations (i.e., genes, pseudogenes, functional RNAs, etc.) from Ensembl, NCBI and Vertebrate Genome Annotation database (Vega) are reconciled with the genome features in the Mouse Genome Informatics (MGI) database (http://www.informatics.jax.org) into a comprehensive and non-redundant catalog. Our gene unification method employs an algorithm (fjoin--feature join) for efficient detection of genome coordinate overlaps among features represented in two annotation data sets. Following the analysis with fjoin, genome features are binned into six possible categories (1:1, 1:0, 0:1, 1:n, n:1, n:m) based on coordinate overlaps. These categories are subsequently prioritized for assessment of annotation equivalencies and differences. The version of the unified catalog reported here contains more than 59,000 entries, including 22,599 protein-coding coding genes, 12,455 pseudogenes, and 24,007 other feature types (e.g., microRNAs, lincRNAs, etc.). More than 23,000 of the entries in the MGI gene catalog have equivalent gene models in the annotation files obtained from NCBI, Vega, and Ensembl. 12,719 of the features are unique to NCBI relative to Ensembl/Vega; 11,957 are unique to Ensembl/Vega relative to NCBI, and 3095 are unique to MGI. More than 4000 genome features fall into categories that require manual inspection to resolve structural differences in the gene models from different annotation sources. Using the MGI unified gene catalog, researchers can easily generate a comprehensive report of mouse genome features from a single source and compare the details of gene and transcript structure using MGI's mouse genome browser.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
An overview of the gene unification process. Following the comparison of gene predictions and curated annotations using fjoin, the coordinate-based overlap results are binned into six categories. Three of the categories (1:1, 0:1, 1:0) can be loaded into MGI with minimal manual assessment. The other three categories (1:n, n:1, n:m) require manual assessment followed by resolution of annotation discrepancies through communication with the annotation provider(s) or by changes in MGI
Fig. 2
Fig. 2
Example of genome features in the 1:1, 1:0, and 0:1 categories generated by fjoin. a (i) The Arl8b and Edem1 genes have equivalent (1:1) predictions in NCBI and Ensembl, but these genes are not currently represented in the Vega database (1:0). (ii) The NCBI non-protein-coding RNA gene (GeneID:102638990) is unique to the predictions from NCBI (0:1). (iii) The MGI gene, 9430088B20Rik (MGI:2445127), is unique to MGI (0:1). b (i) The Olfr794 gene (MGI:3030628) has equivalent (1:1) predictions in NCBI and Ensembl, but not in Vega (1:0). (ii) The pseudogene, Olfr795-ps1 (MGI:3030629), is only annotated by NCBI. (iii) The miRNA gene Gm23252 (MGI:5453029) is predicted only by Ensembl
Fig. 3
Fig. 3
Example of genome features in the 1:n and n:m categories generated by fjoin. a The lincRNA gene, Gm13853 (MGI:3649279), has a 1:n relationship with two NCBI genes (GeneID:102634942 and GeneID:102634837) based on coordinate overlap shown in the boxed regions. b The ENSEMBL gene models, Gbp8 (ENSMUSG00000034438) and Gbp9 (ENSMUSG00000029298) both have extended first exons that overlap the upstream gene, Gbp4 (ENSMUSG00000079363) (shown in the boxed regions) resulting in a n:m relationship with the NCBI gene Gbp4 (GeneID:17472)
Fig. 4
Fig. 4
Differences in gene definitions among genome annotation groups lead to ambiguity in determining equivalency of genome features. The cases illustrated in this figure reflect differences in how genes are defined rather than annotation errors and are excluded from further manual review. a NCBI and MGI represent Ugt2a1 (MGI:2149905) and Ugt2a2 (MGI:3576095) as two different genes while Ensembl and HAVANA represent the data as a single gene with multiple alternative transcripts. b NCBI’s mouse genome annotation contains separate entries for (i) Esp5 (MGI:5522708) and (ii) Esp6 (MGI:3643294)as well as the (iii) Esp6Esp5 (MGI:5529083) read through product. Ensembl lacks a specific genome annotation for Esp5, but does represent (ii*) Esp6 and (iii*) Esp6Esp5
Fig. 5
Fig. 5
Example of annotation improvements as the results of the collaboration among curators from MGI, NCBI, and Vega. a Vega annotation version 35 for the reference mouse genome (GRCm37) included two separate genes (OTTMUSG0000009560 and OTTMUSG0000009562) that overlapped a single gene in the MGI catalog (Gm853; MGI:2685699). This case was identified by the review of features in the 1:n category following a previous fjoin analysis. b Upon review of all of the evidence, the HAVANA curation team merged gene OTTMUSG0000009560 with OTTMUSG0000009562. The transcript that was previously used as evidence of a different genes is now represented as an alternative processed transcript of OTTMUSG0000009562
Fig. 6
Fig. 6
a The MGI biotype conflict note is shown for the pseudogene, Amy2b (MGI:104547), which is annotated as pseudogene by both Vega and Ensembl but as a protein-coding gene by NCBI. b There is also a Strain-Specific Marker notification displayed for this locus because Amy2b has been shown to be a functional gene in the YBR strain but a null allele in the A/J mouse strain
Fig. 7
Fig. 7
Example of a genome feature in the 1:1 category following fjoin analysis. The Zfp951 (MGI:2441896) gene has equivalent representations in the annotation output from Ensembl, Vega, and NCBI. However, the structural details of the predictions differ because of how evidence from different transcripts was incorporated into the gene model. The model displayed in the MGI Genome Features track represents an aggregate representation of the gene model components from all three prediction/annotation resources. The arrows highlight features that are present in gene predictions from Ensembl and HAVANA/Vega but not from NCBI

References

    1. Bradley A, Anastassiadis K, Ayadi A, Battey JF, Bell C, Birling MC, Bottomley J, Brown SD, Burger A, Bult CJ, Bushell W, Collins FS, Desaintes C, Doe B, Economides A, Eppig JT, Finnell RH, Fletcher C, Fray M, Frendewey D, Friedel RH, Grosveld FG, Hansen J, Herault Y, Hicks G, Horlein A, Houghton R, Hrabe de Angelis M, Huylebroeck D, Iyer V, de Jong PJ, Kadin JA, Kaloff C, Kennedy K, Koutsourakis M, Lloyd KC, Marschall S, Mason J, McKerlie C, McLeod MP, von Melchner H, Moore M, Mujica AO, Nagy A, Nefedov M, Nutter LM, Pavlovic G, Peterson JL, Pollock J, Ramirez-Solis R, Rancourt DE, Raspa M, Remacle JE, Ringwald M, Rosen B, Rosenthal N, Rossant J, Ruiz Noppinger P, Ryder E, Schick JZ, Schnutgen F, Schofield P, Seisenberger C, Selloum M, Simpson EM, Skarnes WC, Smedley D, Stanford WL, Stewart AF, Stone K, Swan K, Tadepally H, Teboul L, Tocchini-Valentini GP, Valenzuela D, West AP, Yamamura K, Yoshinaga Y, Wurst W. The mammalian gene function resource: the International Knockout Mouse Consortium. Mamm Genome. 2012;23:580–586. doi: 10.1007/s00335-012-9422-2. - DOI - PMC - PubMed
    1. Brown GR, Hem V, Katz KS, Ovetsky M, Wallin C, Ermolaeva O, Tolstoy I, Tatusova T, Pruitt KD, Maglott DR, Murphy TD. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 2015;43:D36–D42. doi: 10.1093/nar/gku1055. - DOI - PMC - PubMed
    1. Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013;41:D226–D232. doi: 10.1093/nar/gks1005. - DOI - PMC - PubMed
    1. Chan PP, Lowe TM. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 2009;37:D93–D97. doi: 10.1093/nar/gkn787. - DOI - PMC - PubMed
    1. Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR, Chin CS, Kitts PA, Aken B, Marth GT, Hoffman MM, Herrero J, Mendoza ML, Durbin R, Flicek P. Extending reference assembly models. Genome Biol. 2015;16:13. doi: 10.1186/s13059-015-0587-3. - DOI - PMC - PubMed

Publication types

LinkOut - more resources