Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2006 Sep 18:7:236.
doi: 10.1186/1471-2164-7-236.

Comparison of methods for genomic localization of gene trap sequences

Affiliations
Comparative Study

Comparison of methods for genomic localization of gene trap sequences

Courtney A Harper et al. BMC Genomics. .

Abstract

Background: Gene knockouts in a model organism such as mouse provide a valuable resource for the study of basic biology and human disease. Determining which gene has been inactivated by an untargeted gene trapping event poses a challenging annotation problem because gene trap sequence tags, which represent sequence near the vector insertion site of a trapped gene, are typically short and often contain unresolved residues. To understand better the localization of these sequences on the mouse genome, we compared stand-alone versions of the alignment programs BLAT, SSAHA, and MegaBLAST. A set of 3,369 sequence tags was aligned to build 34 of the mouse genome using default parameters for each algorithm. Known genome coordinates for the cognate set of full-length genes (1,659 sequences) were used to evaluate localization results.

Results: In general, all three programs performed well in terms of localizing sequences to a general region of the genome, with only relatively subtle errors identified for a small proportion of the sequence tags. However, large differences in performance were noted with regard to correctly identifying exon boundaries. BLAT correctly identified the vast majority of exon boundaries, while SSAHA and MegaBLAST missed the majority of exon boundaries. SSAHA consistently reported the fewest false positives and is the fastest algorithm. MegaBLAST was comparable to BLAT in speed, but was the most susceptible to localizing sequence tags incorrectly to pseudogenes.

Conclusion: The differences in performance for sequence tags and full-length reference sequences were surprisingly small. Characteristic variations in localization results for each program were noted that affect the localization of sequence at exon boundaries, in particular.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Recall and precision for each localization algorithm. Results for SSAHA are shown in red, MegaBLAST in blue, and BLAT in green. The first column represents the recall obtained with full-length gene query sequences. The second column shows the recall obtained with sequence tag queries. The third and fourth columns display the precision of each algorithm when used to localize full-length genes and sequence tags, respectively. (A) Recall and precision at the level of the gene, as measured by overlap of at least one nucleotide between a set of localizations by an algorithm and the region of the genome containing the gene. Cyan lines indicate the recall and precision achieved when only the top hit is considered. (B) Exon recall and precision, as measured by an overlap of at least one nucleotide between the known localization of an exon and a match. Sequence tags are shorter than full-length genes and therefore typically contain sufficient sequence information to match only a few exons of any gene, leading to low recall at the exon and nucleotide levels. This does not indicate failure by the localization programs. (C) Nucleotide recall and precision, as measured by a match between a nucleotide in the known localization of a gene and a nucleotide from a query sequence localization.
Figure 2
Figure 2
An example of localization to a pseudogene. Localization results for the full-length gene encoding mitotic arrest deficient 1-like 1 (Mad1l1), GenBank accession NM_010752. All representations of alignments between query sequences and build 34 of the mouse genome were made using the UCSC Genome Browser Custom Tracks feature. Slight alterations have been made to the representations, including the removal of graphical elements to improve the clarity of the figure, but no changes were made to the alignments. (A) The coordinates of the known gene on the genome are listed at the top, and positions of exons are represented by colored blocks. A region of chromosome 5 is shown containing the known localization of NM_010752 (the Known Genes track at bottom) and the alignments of exons for NM_010752 to the genome by SSAHA, MegaBLAST, and BLAT. (B) A region of chromosome 9 containing a pseudogene related to NM_010752 is shown on the same scale as (A). Below this, the segment of chromosome 9 containing the pseudogene is enlarged. The highest-scoring MegaBLAST match, circled in cyan, localizes to this pseudogene rather than the real gene. The highest scoring matches returned by SSAHA and BLAT are located on chromosome 5 and overlap with the correct localization.
Figure 3
Figure 3
A representative genome alignment of a full-length gene and a sequence tag. The full-length gene encoding chromatin assembly factor 1, subunit A (Chaf1a), NCBI accession NM_013733, and the sequence tag BG-RRR265 align to a region of chromosome 17. (A) Overview showing the full region of the genome spanned by Chaf1a. Segments enlarged in the parts B-C are marked above the genome position. (B) Regions of genome that have been removed from the search space by repeat masking are shown in yellow, superimposed on the known gene track. The removal of these regions prevents correct localization of the full-length gene and sequence tag for these exons. (C) Magnification of the exon from region C illustrates differences between the alignment programs in aligning sequence to the edges of exons.
Figure 4
Figure 4
A summary of the alignments by each program to the edges of exons. A representation of an exon is shown at top, with a representation of the three possible match outcomes below, i.e., an exact match to the exon boundary, a match that ends before the exon boundary, and a match that extends beyond the exon boundary. The percentage of all matches by each program that fall into those categories are depicted as bar graphs. Left: Percentage of matches correctly aligned to either exon boundary. Middle and right: Percentage of matches incorrectly aligned to an exon boundary, with the match ending before or extending beyond a boundary, respectively.

References

    1. Stanford WL, Cohn JB, Cordes SP. Gene-trap mutagenesis: past, present and beyond. Nat Rev Genet. 2001;2:756–768. doi: 10.1038/35093548. - DOI - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. - DOI - PubMed
    1. Stryke D, Kawamoto M, Huang CC, Johns SJ, King LA, Harper CA, Meng EC, Lee RE, Yee A, L'Italien L, Chuang PT, Young SG, Skarnes WC, Babbitt PC, Ferrin TE. BayGenomics: a resource of insertional mutations in mouse embryonic stem cells. Nucleic Acids Res. 2003;31:278–281. doi: 10.1093/nar/gkg064. - DOI - PMC - PubMed
    1. Nord AS, Chang PJ, Conklin BR, Cox AV, Harper CA, Hicks GG, Huang CC, Johns SJ, Kawamoto M, Liu S, Meng EC, Morris JH, Rossant J, Ruiz P, Skarnes WC, Soriano P, Stanford WL, Stryke D, von Melchner H, Wurst W, Yamamura K, Young SG, Babbitt PC, Ferrin TE. The International Gene Trap Consortium Website: a portal to all publicly available gene trap cell lines in mouse. Nucleic Acids Res. 2006;34:D642–8. doi: 10.1093/nar/gkj097. - DOI - PMC - PubMed
    1. Skarnes WC, von Melchner H, Wurst W, Hicks G, Nord AS, Cox T, Young SG, Ruiz P, Soriano P, Tessier-Lavigne M, Conklin BR, Stanford WL, Rossant J. A public gene trap resource for mouse functional genomics. Nat Genet. 2004;36:543–544. doi: 10.1038/ng0604-543. - DOI - PMC - PubMed

Publication types