. 2009 Jun;37(11):e78.

doi: 10.1093/nar/gkp295. Epub 2009 May 8.

TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences

Yujun Han¹, James M Burnette 3rd, Susan R Wessler

Affiliations

PMID: 19429695
PMCID: PMC2699529
DOI: 10.1093/nar/gkp295

TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences

Yujun Han et al. Nucleic Acids Res. 2009 Jun.

. 2009 Jun;37(11):e78.

doi: 10.1093/nar/gkp295. Epub 2009 May 8.

Authors

Yujun Han¹, James M Burnette 3rd, Susan R Wessler

Affiliation

¹ Department of Plant Biology, University of Georgia, Athens, GA 30602, USA.

PMID: 19429695
PMCID: PMC2699529
DOI: 10.1093/nar/gkp295

Abstract

Gene families compose a large proportion of eukaryotic genomes. The rapidly expanding genomic sequence database provides a good opportunity to study gene family evolution and function. However, most gene family identification programs are restricted to searching protein databases where data are often lagging behind the genomic sequence data. Here, we report a user-friendly web-based pipeline, named TARGeT (Tree Analysis of Related Genes and Transposons), which uses either a DNA or amino acid 'seed' query to: (i) automatically identify and retrieve gene family homologs from a genomic database, (ii) characterize gene structure and (iii) perform phylogenetic analysis. Due to its high speed, TARGeT is also able to characterize very large gene families, including transposable elements (TEs). We evaluated TARGeT using well-annotated datasets, including the ascorbate peroxidase gene family of rice, maize and sorghum and several TE families in rice. In all cases, TARGeT rapidly recapitulated the known homologs and predicted new ones. We also demonstrated that TARGeT outperforms similar pipelines and has functionality that is not offered elsewhere.

PubMed Disclaimer

Figures

**Figure 1.**
Map of the five main steps of the TARGeT pipeline. Users are able to inspect the results of each step before going on to the next step. (A) Preparation of the query when more than one sequence is being used. This is an optional step and its output is shown in Figure 2. (B) BLAST search. Results are shown in Figure 3. (C) Homolog identification by PHI. The algorithm is explained in Figure 4 and the result of this step is shown in Figure 5. (D) Multiple alignment. (E) Tree building.

**Figure 2.**
Multiple alignment of *Arabidopsis* APx protein sequences. Sequences in the boxed region were extracted to form the query sequences. APx7 was not included because it aligns poorly.

**Figure 3.**
TARGeT output provides a rough visualization of the BLAST result. X-axis is the length of the query; Y-axis is the number of BLAST HSPs. The gray gradient shows the similarity which is calculated by dividing the sum of identities and similarities by the number of the aligned amino acids along the HSP. Darker represents higher similarity at that position.

**Figure 4.**
The sorting and refinement stages of the PHI program. See the text for details. (A) In the grouping stage, alignments are sorted and grouped. Dark bars are queries and colored bars are homologs. Each group corresponds to one putative homolog. The green group is shown in detail to illustrate potential problems. (B) Two overlapping HSPs together with six possible alternative positions are shown. The separation that produces the highest score in the overlapping region is noted with a red check. (C) An HSP that includes an intron. The intron is detected and cut out by PHI, resulting in two separated HSPs. Red asterisks represent premature stop codons. (D) Figure presentation of the result after the refinement stage. There is no overlap between HSPs 1 and 2. HSP 3 in (C) is separated by the small intron into new HSP 3′ and 4′. An additional exon (5) was found and is shown in pink.

**Figure 5.**
TARGeT output of the gene structure of rice APx family members. (A) Exon intron structure of 11 reliable rice APx homologs detected by TARGeT. All 46 putative homologs are in Supplementary Figure 1. (B) A larger figure of TOAPx_9 from (A). Query and subject names are shown on the left. ‘+’ or ‘−’ indicates the strand of the hit. Unmatched query regions at the ends of each homolog are in blue. Black or gray gradient bars represent the exons. Darker represents higher similarity. Numbers flanking each gene structure are positions of the subject, while numbers above and below the exons are the positions of the query. Red numbers indicate discontinuous predicated exons. Putative new APx homologs are indicated by ‘*’.

**Figure 6.**
An unrooted phylogenetic tree of all rice APx family members predicted by TARGeT. Previously characterized APx gene names are in brackets. The shaded region contains the true rice APx homologs. Bootstrap values greater than 70 are shown.

**Figure 7.**
An unrooted phylogenetic tree of the APx homologs of rice, maize, sorghum and *Arabidopsis*. This tree was generated with MEGA version4 using the neighbor-jointing method with pairwise deletion and p distance. Five main clades are labeled from A to E. A main clade is defined as a minimal group of homologs that can be found in all species. The remaining homologs are classified into orphan clades O1–O3. Bootstrap values higher than 70 are shown.

**Figure 8.**
A rooted phylogenetic tree of predicted rice Tc1/*mariner* transposases. Three clades (A, B and C) are defined using the phylogenetic tree generated by Feschotte and Wessler (56). Elements denoted by an asterisk are new transposases predicted by TARGeT. Soymar1 was used as an outgroup and the tree was rooted manually using TreeView. Bootstrap values greater than 70 are shown.

See this image and copyright information in PMC

References

1. Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica) Science. 2002;296:79–92. - PubMed
1. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica) Science. 2002;296:92–100. - PubMed
1. Li WH, Gu Z, Wang H, Nekrutenko A. Evolutionary analyses of the human genome. Nature. 2001;409:847–849. - PubMed
1. Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, et al. Comparative genomics of the eukaryotes. Science. 2000;287:2204–2215. - PMC - PubMed
1. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

52005731/Howard Hughes Medical Institute/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences

Affiliation

TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources