Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Jan 1;33(Database issue):D641-6.
doi: 10.1093/nar/gki115.

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

Affiliations

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

Sébastien Aubourg et al. Nucleic Acids Res. .

Abstract

Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of the gene families in the GeneFarm database according to the number of annotated paralogs in the Arabidopsis thaliana genome.
Figure 2
Figure 2
Distribution of the genes annotated in the GeneFarm database according to their scores at the structural and functional levels. The structural score depends on the origin of the annotated intron–exon structure: s1, prediction software only; s2, prediction software and similarities with homologous genes; s3, the gene structure is partially covered by a transcript (EST, RT–PCR product, etc.); s4, the whole CDS is covered by a transcript; and s5, a cognate full-length cDNA is available (TSS and UTR are known). The functional score: f1, unknown function (no information); f2, some predicted clues (motif, signal, etc.); f3, similarities with a known gene; f4, biochemical function proved; and f5, biological function experimentally shown.
Figure 3
Figure 3
Examples of corrections to TIGR annotations proposed by GeneFarm. (A) Fusion of two PPR genes revealed by a detailed definition of the repeat motifs (4 different matrixes have been defined by GeneFarm annotators to exhaustively tag all the repeat motifs of the PPR family), presence of C-terminal DYW motifs and cognate transcripts. (B) The consequence of this fusion of a PPR gene with a downstream gene is the attribution of a function on the basis of the presence of PFAM motifs PF03765 and PF00650. GeneFarm suggests two genes instead of one based on the presence of a C-terminal DYW motif in the first gene. The second gene has not been re-annotated in the framework of GeneFarm. (C) Gene fusion and erroneous exon boundaries. The GeneFarm corrections are supported by the fact that the gene model is shared by other members of the CYP sub-group, a cognate EST and better scores with the Pfam motif PF00067. Blue arrows and lines: CDS exons and introns, respectively. Brown arrows: PFAM motifs mapped to exons. Pink arrows: transcript sequences. Other arrows: different types of PPR repeats.

References

    1. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796–815. - PubMed
    1. Bork P. and Koonin,E.V. (1998) Predicting functions from protein sequences, where are the bottlenecks? Nature Genet., 18, 313–318. - PubMed
    1. Terryn N., Heijnen,L., De Keyser,A., Van Asseldonck,M., De Clercq,R., Verbakel,H., Gielen,J., Zabeau,M., Villarroel,R., Jesse,T. et al. (1999) Evidence for an ancient chromosomal duplication in Arabidopsis thaliana by sequencing and analysing a 400-kb contig at the APETALA2 locus on chromosome 4. FEBS Lett., 445, 237–245. - PubMed
    1. Smith T.F. and Zhang,X. (1997) The challenges of genome sequence annotation or ‘the devil is in the details’. Nat. Biotechnol., 15, 1222–1223. - PubMed
    1. Gilks W.R., Audit,B., De Angelis,D., Tsoka,S. and Ouzounis,C.A. (2002) Modelling the percolation of annotation errors in a database of protein sequences. Bioinformatics, 18, 1641–1649. - PubMed

Publication types

Substances