Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Jun;13(6B):1505-19.
doi: 10.1101/gr.991003.

Connecting sequence and biology in the laboratory mouse

Affiliations

Connecting sequence and biology in the laboratory mouse

Richard M Baldarelli et al. Genome Res. 2003 Jun.

Abstract

The Mouse Genome Sequencing Consortium and the RIKEN Genome Exploration Research grouphave generated large sets of sequence data representing the mouse genome and transcriptome, respectively. These data provide a valuable foundation for genomic research. The challenges for the informatics community are how to integrate these data with the ever-expanding knowledge about the roles of genes and gene products in biological processes, and how to provide useful views to the scientific community. Public resources, such as the National Center for Biotechnology Information (NCBI; http://www.ncbi.nih.gov), and model organism databases, such as the Mouse Genome Informatics database (MGI; http://www.informatics.jax.org), maintain the primary data and provide connections between sequence and biology. In this paper, we describe how the partnership of MGI and NCBI LocusLink contributes to the integration of sequence and biology, especially in the context of the large-scale genome and transcriptome data now available for the laboratory mouse. In particular, we describe the methods and results of integration of 60,770 FANTOM2 mouse cDNAs with gene records in the databases of MGI and LocusLink.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The flow from FANTOM2 mouse cDNA clones to genes, to their integration with NCBI LocusLink and MGI. Mouse cDNA clones isolated (closed circles in the first panel) and sequenced (horizontal lines in the second panel) by the RIKEN group are clustered computationally (top clusters in the second panel). Computed clusters are then resolved into gene-specific groups by human inspection (bottom clusters in the second panel). Dotted lines represent transcript variation. Computed clusters can group sequences from different genes, such as paralogs and read-through transcripts (third and fourth computed clusters from left, respectively), and other distinct gene sequences that share some region of overlap requiring manual resolution. CDS regions for protein coding genes are indicated (horizontal arrows over clusters). Equivalence of FANTOM2 sequences with known mouse genes in NCBI LocusLink and MGI is detected by incorporation of known sequences in the FANTOM2 clusters or by BLAST (data not shown). LocusLink and MGI contain overlapping but distinct sequence data sets. Some characterized mouse sequences not present in LocusLink or MGI can have sequence identity to FANTOM2 sequences (far right cluster). Remaining FANTOM2 genes are considered novel. The curation of sequences for novel and known mouse genes is coordinated between LocusLink and MGI, and LocusLink establishes RefSeqs (third panel). Genome centers feed predicted gene models to NCBI, but rely on transcript-based evidence in the form of RefSeqs to improve genome annotations. Gene models with enriched annotations link back to gene records in LocusLink and MGI on the basis of integrated sequence accessions. Through data coordination, LocusLink and MGI establish a catalog of mouse genes with accurate sequence associations and integrated biological information.
Figure 2
Figure 2
Schematic showing status and number code assignments for clusters and their use in cluster curation. Hypothetical outputs of the three cluster builds for six FANTOM2 clones are shown. Status and number codes for each clone, as well as cluster union IDs, appear in the table to the right. FANTOM2 clones and non-RIKEN public sequences are shown as solid and open boxes, respectively. Sequences associated with MGI genes are distinguished by block arrows. In the top set of clusters, RIKEN clones 1 and 2 were grouped with the same EST sequences in NCBI UniGene and TIGR clusters, and were assigned the same cluster union ID (100). The RIKEN status code (-NT) for these clones indicates that NCBI Unigene and TIGR clusters are the same for these clones, but that RIKEN clusters A and B are different. The RIKEN number code (1,2,2) indicates that one, two, and two total RIKEN clones were clustered with those clones (including themselves) in the three respective cluster builds, irrespective of clone identities. The MGI status code (-NT) indicates that only the UniGene and TIGR clusters grouped sequences associated with the same number and identity of MGI genes (only MGI gene Shh is represented via EST 1234). The MGI number code (0,1,1) indicates that a single MGI gene is represented in the UniGene and TIGR clusters (irrespective of gene identities) and none in the RIKEN clusters containing these clones. The bottom set shows the clusters containing four additional RIKEN clones (3 through 6). Clones 3, 4, and 5 are grouped in UniGene cluster Mm.12; yet, clones 3, 4, and 6 are grouped in the TIGR cluster TC:22, thus all four clones are assigned to the same cluster union (200). The variability in RIKEN status codes indicates that each clone was grouped differently from the others by the three builds. The MGI number code (0,2,1) for three of the clones (3, 4, and 5) indicates that the UniGene cluster containing them (Mm.12) has grouped sequences associated with two MGI genes (Fgf4 and Fgf5), whereas only one MGI gene is represented in each of the TIGR clusters that contain them (TC:22, and TC:23). The MGI status code (---) for each clone indicates that no two clusters containing them represent exactly the same set of MGI genes. For this example, curators would determine if the sequence associations in UniGene cluster Mm.12 are biologically appropriate; if so, then the MGI gene records involved may need to be merged into a single record.
Figure 3
Figure 3
Combining cluster visualization tools with the MGI FANTOM2 data table for accurate integration in MGI. (A) Alignment view display, a cluster visualization tool available at the FANTOM2 Web interface. FANTOM2 sequences grouped in RIKEN cluster (locus ID) 22339 are shown as colored bars. RIKEN clone IDs are shown to the left of each sequence, as are the corresponding row numbers for the sequences in the MGI FANTOM2 table in B. Sequence alignments are with respect to the top sequence (black), as are various features, including sequence similarity (color-coded as shown) and gaps. The green arrows above the sequences represent predicted CDS regions (shown). The gaps in sequences 5 and 6 (intron) reveal the presence of an unspliced intron in sequences 3 and 8. Note truncation of the CDS at this position in sequences 3 and 8. Sequences 5 and 6 are properly spliced. Sequences 3, 4, and 7 are partial transcripts. Non-RIKEN sequences are not shown in this view. (B) MGI FANTOM2 data table display of the FANTOM2 sequences in A and two non-RIKEN sequences (blue) included in this cluster union (R Cluster 3268). Rows and columns correspond to sequences and sequence features, respectively. Rows are color-coded to reflect sequence origin or other status (as shown). Sequences 3 and 8 are marked as problem sequences because they contain an unprocessed intron (Seq Qual: Problem-in). Sequence 6 was selected as the representative clone (Seq Note: Representative). Sequences 1 and 3 were associated with MGI gene Dnajc5 before the FANTOM2 load, sequence 4 with MGI gene 2610314I24Rik (RA symbol). All sequences are associated with MGI gene Dnajc5 after the FANTOM2 load (Final symbol 1). (C) Integration in MGI. The FANTOM1 clone 2610314I24 (sequence 4 in A, B) does not overlap the coding region of Dnajc5 and was represented as a unique MGI gene during the FANTOM1 load (Symbol: 2610314I24Rik), whereas FANTOM1 clone 1810057D19 (sequence 3 in A, B), which does overlap the CDS, was associated with the Dnajc5 gene. FANTOM2-new sequences reveal that sequence 4 is actually derived from the 3′-UTR region of Dnajc5 and that sequence 3 contains an intron that truncates the CDS. This information triggered a merge in MGI, in which the 2610314I24Rik gene was withdrawn to equal Dnajc5. The MGI accession ID for the previous gene (MGI:1919766) becomes a secondary accession ID for the Dnajc5 gene (shown), and all information previously associated with 2610314I24Rik was migrated to Dnajc5. The nomenclature history for the Dnajc5 gene details this event. The molecular segment record for clone D030049H18 (sequence 8 in A, B), an intron-containing transcript (problem sequence) is shown. A note is attached to molecular segment records of problem sequences to inform users that the sequence has been judged by curators to have some type of problem. Key to MGI FANTOM2 table columns (see Methods for descriptions): SeqID indicates RIKEN Seqid; clone ID, RIKEN cloneid; GenBank ID, DDBJ/EMBL/GenBank seqid; RA MGI ID, MGI ID to which the sequence was associated before the FANTOM2 load; RA symbol, gene symbol corresponding to the RA MGI ID; Seq length, sequence length (bp); locus ID, RIKEN cluster ID; UniGene ID, NCBI UniGene cluster ID; TIGR TC, TIGR cluster ID; R cluster, cluster union ID; locus stat, RIKEN status code; RIKEN #, RIKEN number code; MGI status, MGI status code; MGI #, MGI number code; BLAST group ID; Seq qual, sequence quality; Seq note, sequence note (to designate Representative clone); final MGI ID, MGI ID to which the sequence is associated after the FANTOM2 load; and final symbol 1, gene symbol corresponding to the Final MGI ID.
Figure 4
Figure 4
Gene Representation in MGI and LocusLink. (A) Emerging representation of the flexed tail gene (sideroflexin 1, Sfxn1). A gene record for the flexed tail (f) mouse mutation, described by Hunt et al. (1933), is created in MGI. Over time, MGI captures published information about the flexed tail locus; however, no sequence information is available. Clone 2810002O05, a novel mouse cDNA sequence is released with the FANTOM1 data, and a gene record is created in MGI and LocusLink for the sequence. Sequence-based annotations (GO terms, protein, domains, UniGene) are associated with gene 2810002O05Rik, and the MGI/LocusLink coordinated data exchange begins. LocusLink creates a RefSeq for the gene. After release of FANTOM1 data, Fleming et al. (2001) report the cloning of flexed tail and its sequence. Sequence analysis reveals that the flexed tail sequence is identical to the FANTOM1 cDNA. Gene 2810002O05Rik is merged with the flexed tail gene, and based on Fleming et al. (2001), the gene is renamed sideroflexin 1 (Sfxn1), for the siderocytic anemia and flexed tail phenotypes observed in mutant mice (see Fig. 4B). (B) Current representation of the Sfxn1 gene record in MGI and LocusLink, demonstrating the types of information integrated with sequences at the two resources. Wide arrows indicate data types shared between MGI and LocusLink, and the direction of transfer. MGI and LocusLink also exchange gene name synonyms and corresponding gene record identifiers. Hypertext links to various annotations and data are provided at both resources: official mouse gene nomenclature (MGI provides to LocusLink; A), mapping information (reconciled between MGI and LocusLink; B), allele and phenotype information (MGI; C), polymorphisms (LocusLink provides links to dbSNP, data not shown; D), gene ontology (MGI provides to LocusLink; E), homology information (MGI provides curated mammalian orthology data; F), expression (MGI; G), UniGene (H), LocusLink/MGI reciprocal links (I), mouse genome annotations (J), protein domains (also at LocusLink, data not shown; K), Database of Transcribed Sequences (DoTS, MGI; L), TIGR Mouse Gene Index (MGI; M), mRNA-genome alignments (LocusLink; N), references (O), RefSeqs (LocusLink provides to MGI; P), and sequences (exchanged between MGI and LocusLink; Q).
Figure 5
Figure 5
Representation of a novel FANTOM2 gene in MGI. Detail pages for the gene and molecular segment objects and for the sequence summary report are shown. Novel FANTOM2 gene nomenclature incorporates the RIKEN clone IDs of representative sequences from clusters. Clusters of sequences for the same gene are represented by associating the sequence identifiers and molecular segment records of all cluster members to the gene record (of the 37 molecular segments of type cDNA for this gene shown, five are FANTOM2 clones; the rest are IMAGE cDNAs associated with gene 0910001B06Rik via UniGene cluster 28470). Molecular segment records for FANTOM2 clones contain clone library source information, and they link to the FANTOM2 annotation pages for the corresponding sequences.

Similar articles

Cited by

References

    1. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. d2000. Gene ontology: Tool for the unification of biology: The Gene Ontology Consortium. Nat Genet. 25: 25-29. - PMC - PubMed
    1. Blake, J.A., Richardson, J.E., Bult, C.J., Kadin, J.A., Eppig, J.T., and the Mouse Genome Database Group. 2002. The Mouse Genome Database (MGD): The model organism database for the laboratory muse. Nucleic Acids Res. 30: 113-115. - PMC - PubMed
    1. Bult, C.J., Richardson, J.E., Blake, J.A., Kadin, J.A., Ringwald, M., Eppig, J.T., and the Mouse Genome Database Group. 2000. Mouse genome informatics in a new age of biological inquiry. Proceedings of the IEEE International Symposium on Bio-Informatics and Biomedical Engineering pp. 29-32. IEEE Computer Society, Los Alamitos, California.
    1. Carninci, P., Waki, K., Shiraki, T., Konno, H., Shibata, K., Itoh, M., Aizawa, K., Arakawa, T., Ishii, Y., Sasaki. D., et al. 2003. Targeting a complex transcriptome: The construction of the mouse full-length cDNA encyclopedia. Genome Res. (this issue). - PMC - PubMed
    1. Fleming, M.D., Campagna, D.R., Haslett, J.N., Trenor III, C.C., and Andrews, N.C. 2001. A mutation in a mitochondrial transmembrane protein is responsible for the pleiotropic hematological and skeletal phenotype of flexed-tail (f/f) mice. Genes & Dev. 15: 652-657. - PMC - PubMed

WEB SITE REFERENCES

    1. http://www.ncbi.nih.gov/; National Center for Biotechnology Information.
    1. ftp://ftp.informatics.jax.org/pub/reports/MGI_ProblemSequence.rpt; Mouse Genome Informatics FTP site.
    1. ftp://ftp.informatics.jax.org/pub/informatics/reports; Mouse Genome Informatics FTP site.
    1. ftp://ftp.ncbi.nih.gov/refseq/LocusLink/; LocusLink FTP site.
    1. http://www.ncbi.nih.gov/mapview/; National Center for Biotechnology Information Map Viewer.

Publication types

MeSH terms

Substances

LinkOut - more resources