Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 4;10(12):e1003998.
doi: 10.1371/journal.pcbi.1003998. eCollection 2014 Dec.

Extensive error in the number of genes inferred from draft genome assemblies

Affiliations

Extensive error in the number of genes inferred from draft genome assemblies

James F Denton et al. PLoS Comput Biol. .

Abstract

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Examples of missassembly leading to misannotation.
Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes.
Figure 2
Figure 2. Differences in gene family size when comparing annotated draft genomes (see Table 1 for individual descriptions) to the chicken reference assembly (v4.0).
For each gene family, the size (in total number of genes predicted) was compared to the chicken reference; positive numbers indicate an excess number of genes in the draft genome annotations, while negative numbers indicate a deficit of genes. The small number of gene families with more than +/−3 differences from the reference are not shown. Gene models were predicted using GENSCAN.
Figure 3
Figure 3. Differences in gene family size when comparing the initial chimpanzee assembly (Pan_troglodytes-1.0) to an updated version (Pan_troglodytes-2.1).
Positive numbers indicate an excess number of genes in v1.0, while negative numbers indicate an excess. The small number of gene families with more than +/−3 differences from the reference are not shown.
Figure 4
Figure 4. Number of predicted genes increases with increasing genome fragmentation.
Starting with the D. melanogaster reference genome (release 5.41), the sequence was cut into increasing numbers of “contigs.” GENSCAN gene predictions for each assembly are shown.
Figure 5
Figure 5. Number of predicted exons per gene decreases with increased genome fragmentation.
A comparison of the number of predicted exons per gene in the uncut D. melanogaster reference genome to the fragmented version of this genome that contains 17,941 contigs (the right-most point in Fig. 4). Gene models were predicted using GENSCAN.

References

    1. Demuth JP, De Bie T, Stajich JE, Cristianini N, Hahn MW (2006) The evolution of mammalian gene families. PLoS ONE 1: e85. - PMC - PubMed
    1. Hahn MW, Han MV, Han S-G (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genetics 3: e197. - PMC - PubMed
    1. Floudas D, Binder M, Riley R, Barry K, Blanchette RA, et al. (2012) The paleozoic origin of enzymatic lignin decomposition reconstructed from 31 fungal genomes. Science 336: 1715–1719. - PubMed
    1. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature 464: 704–712. - PMC - PubMed
    1. Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M (2008) Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster . Science 320: 1629–1631. - PubMed

Publication types

LinkOut - more resources