Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Jan;13(1):46-54.
doi: 10.1101/gr.830003.

Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map

Affiliations

Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map

Paul Flicek et al. Genome Res. 2003 Jan.

Abstract

The availability of draft sequences for both the mouse and human genomes makes it possible, for the first time, to annotate whole mammalian genomes using comparative methods. TWINSCAN is a gene-prediction system that combines the methods of single-genome predictors like GENSCAN with information derived from genome comparison, thereby improving accuracy. Because TWINSCAN uses genomic sequence only, it is less biased toward highly and/or ubiquitously expressed genes than GENEWISE, GENOMESCAN, and other methods based on evidence derived from transcripts. We show that TWINSCAN improves gene prediction in human using intermediate products from various stages of the sequencing and analysis of the mouse genome, from low-redundancy, whole-genome shotgun reads to the draft assembly and the synteny map. TWINSCAN improves on the prior state of the art even when alignments from only 1X coverage of the mouse genome are available. Gene prediction accuracy improves steadily from 1X through 3X, more slowly from 3X to 4X, and relatively little thereafter. The assembly and the synteny map greatly speed the computations, however. Our human annotation using the mouse assembly is conservative, predicting only 25,622 genes, and appears to be one of the best de novo annotations of the human genome to date.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Exact gene accuracy with respect to aligned RefSeq transcripts, as a function of mouse genome sequence aligned.
Figure 2.
Figure 2.
Characteristics of alignments of various mouse sequences to the human genome. Bars indicate the percentage of the human genome covered using our alignment procedure. Diamonds indicate the conditional uncertainty of the annotation given the alignments. Lower conditional uncertainty corresponds to more informative alignments.
Figure 3.
Figure 3.
Accuracy of GENSCAN and TWINSCAN by the exact gene, exact exon, and coding nucleotide measures. TWINSCAN predictions use alignments from the draft mouse assembly.
Figure 4.
Figure 4.
A detailed view of a TWINSCAN prediction (red), a GENSCAN prediction (green), and an aligned RefSeq transcript (blue). Masked repetitive and low-complexity regions (yellow) and mouse alignments (black) are indicated. (A) Complete gene prediction at the KIAA1630 gene (NM_018706) from Homo sapiens 10p14. Note that the presence of conservation is neither a necessary (e.g., the first exon), nor a sufficient (e.g., the first alignment block condition for TWINSCAN to predict an exon. (B) A magnified region around the second exon predicted by GENSCAN. TWINSCAN correctly omits this exon because the conserved region ends within it. (C) A magnified region around the 11th and 12th RefSeq exons. TWINSCAN correctly predicts both splice sites because they are within the aligned regions. These images were produced with AceDB (http://www.acedb.org/).
Figure 5.
Figure 5.
Relationships among the genes and exons annotated by TWINSCAN, GENSCAN, and aligned RefSeq transcripts. (A) Number of genes annotated by RefSeq, TWINSCAN, and GENSCAN, and number of exact matches among them. RefSeq and TWINSCAN contain 1,791 identical genes, RefSeq and GENSCAN contain 1,115, TWINSCAN and GENSCAN contain 2,809, and the intersection of all three sets contains 670. (B) Number of unique coding exons annotated by RefSeq, TWINSCAN, and GENSCAN, and number of exact matches among them. RefSeq and TWINSCAN contain 80,530 identical exons, RefSeq and GENSCAN contain 77,442, TWINSCAN and GENSCAN contain 134,507, and the intersection of all three sets contains 67,320.
Figure 6.
Figure 6.
Comparison of the distribution of coding exons per transcript in the TWINSCAN predictions and RefSeq annotations. The last data point includes all transcripts containing >20 coding exons.
Figure 7.
Figure 7.
Fraction of TWINSCAN exons (genes) in each GC bin divided by the fraction of RefSeq exons (genes) in the same bin. Bars above 1.0 represent over-prediction and those below 1.0 represent under-prediction. TWINSCAN tends to predict genes with fewer exons in areas of lower GC content.
Figure 8.
Figure 8.
Effect of GC bin on exact gene prediction. Bars indicate the number of exons per gene in the TWINSCAN predictions and the RefSeq annotations in each GC bin. Points indicate TWINSCAN's sensitivity and specificity for exact gene prediction in each GC bin. Exact gene accuracy is higher when TWINSCAN's predictions have 9–10 exons per gene.

References

    1. Ansari-Lari M.A., Oeltjen, J.C., Schwartz, S., Zhang, Z., Muzny, D.M., Lu, J., Gorrell, J.H., Chinault, A.C., Belmont, J.W., Miller, W., et al. 1998. Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 8: 29-40. - PubMed
    1. Ash R., 1965. Information theory. Wiley, New York.
    1. Bafna V. and Huson, D.H. 2000. The conserved exon method for gene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8: 3-12. - PubMed
    1. Batzoglou S., Pachter, L., Mesirov, J.P., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958. - PMC - PubMed
    1. Birney E. and Durbin, R. 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10: 547-548. - PMC - PubMed

Publication types