GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

Affiliations

PMID: 15608279
PMCID: PMC540069
DOI: 10.1093/nar/gki115

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

Sébastien Aubourg et al. Nucleic Acids Res. 2005.

. 2005 Jan 1;33(Database issue):D641-6.

doi: 10.1093/nar/gki115.

Authors

Affiliation

¹ Unité de Recherche en Génomique Végétale (INRA/CNRS/UEVE) 2 Rue Gaston Crémieux, CP 5708, 91057 Evry Cedex, France. aubourg@evry.inra.fr

PMID: 15608279
PMCID: PMC540069
DOI: 10.1093/nar/gki115

Abstract

Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot.

PubMed Disclaimer

Figures

**Figure 1**
Distribution of the gene families in the GeneFarm database according to the number of annotated paralogs in the *Arabidopsis thaliana* genome.

**Figure 2**
Distribution of the genes annotated in the GeneFarm database according to their scores at the structural and functional levels. The structural score depends on the origin of the annotated intron–exon structure: s1, prediction software only; s2, prediction software and similarities with homologous genes; s3, the gene structure is partially covered by a transcript (EST, RT–PCR product, etc.); s4, the whole CDS is covered by a transcript; and s5, a cognate full-length cDNA is available (TSS and UTR are known). The functional score: f1, unknown function (no information); f2, some predicted clues (motif, signal, etc.); f3, similarities with a known gene; f4, biochemical function proved; and f5, biological function experimentally shown.

**Figure 3**
Examples of corrections to TIGR annotations proposed by GeneFarm. (A) Fusion of two PPR genes revealed by a detailed definition of the repeat motifs (4 different matrixes have been defined by GeneFarm annotators to exhaustively tag all the repeat motifs of the PPR family), presence of C-terminal DYW motifs and cognate transcripts. (B) The consequence of this fusion of a PPR gene with a downstream gene is the attribution of a function on the basis of the presence of PFAM motifs PF03765 and PF00650. GeneFarm suggests two genes instead of one based on the presence of a C-terminal DYW motif in the first gene. The second gene has not been re-annotated in the framework of GeneFarm. (C) Gene fusion and erroneous exon boundaries. The GeneFarm corrections are supported by the fact that the gene model is shared by other members of the CYP sub-group, a cognate EST and better scores with the Pfam motif PF00067. Blue arrows and lines: CDS exons and introns, respectively. Brown arrows: PFAM motifs mapped to exons. Pink arrows: transcript sequences. Other arrows: different types of PPR repeats.

See this image and copyright information in PMC

References

1. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796–815. - PubMed
1. Bork P. and Koonin,E.V. (1998) Predicting functions from protein sequences, where are the bottlenecks? Nature Genet., 18, 313–318. - PubMed
1. Terryn N., Heijnen,L., De Keyser,A., Van Asseldonck,M., De Clercq,R., Verbakel,H., Gielen,J., Zabeau,M., Villarroel,R., Jesse,T. et al. (1999) Evidence for an ancient chromosomal duplication in Arabidopsis thaliana by sequencing and analysing a 400-kb contig at the APETALA2 locus on chromosome 4. FEBS Lett., 445, 237–245. - PubMed
1. Smith T.F. and Zhang,X. (1997) The challenges of genome sequence annotation or ‘the devil is in the details’. Nat. Biotechnol., 15, 1222–1223. - PubMed
1. Gilks W.R., Audit,B., De Angelis,D., Tsoka,S. and Ouzounis,C.A. (2002) Modelling the percolation of annotation errors in a database of protein sequences. Bioinformatics, 18, 1641–1649. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
- The Arabidopsis Information Resource

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

Affiliation

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases