. 2014 Dec 4;10(12):e1003998.

doi: 10.1371/journal.pcbi.1003998. eCollection 2014 Dec.

Extensive error in the number of genes inferred from draft genome assemblies

James F Denton¹, Jose Lugo-Martinez¹, Abraham E Tucker², Daniel R Schrider³, Wesley C Warren⁴, Matthew W Hahn³

Affiliations

¹ School of Informatics and Computing, Indiana University, Bloomington, Indiana.
² Department of Biology, Indiana University, Bloomington, Indiana.
³ School of Informatics and Computing, Indiana University, Bloomington, Indiana; Department of Biology, Indiana University, Bloomington, Indiana.
⁴ The Genome Institute at Washington University, Washington University School of Medicine, Saint Louis, Missouri.

PMID: 25474019
PMCID: PMC4256071
DOI: 10.1371/journal.pcbi.1003998

Extensive error in the number of genes inferred from draft genome assemblies

James F Denton et al. PLoS Comput Biol. 2014.

. 2014 Dec 4;10(12):e1003998.

doi: 10.1371/journal.pcbi.1003998. eCollection 2014 Dec.

Authors

James F Denton¹, Jose Lugo-Martinez¹, Abraham E Tucker², Daniel R Schrider³, Wesley C Warren⁴, Matthew W Hahn³

Affiliations

¹ School of Informatics and Computing, Indiana University, Bloomington, Indiana.
² Department of Biology, Indiana University, Bloomington, Indiana.
³ School of Informatics and Computing, Indiana University, Bloomington, Indiana; Department of Biology, Indiana University, Bloomington, Indiana.
⁴ The Genome Institute at Washington University, Washington University School of Medicine, Saint Louis, Missouri.

PMID: 25474019
PMCID: PMC4256071
DOI: 10.1371/journal.pcbi.1003998

Abstract

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Examples of missassembly leading to misannotation.**
Each row shows the true state of the genome on the left (“Expected assembly”) and a common misassembly error on the right (“Observed misassembly”). A) A single gene may be assembled as two apparently paralogous loci, increasing the predicted gene count. B) A singe gene may be fragmented into multiple pieces, each on different contigs or scaffolds. This cleavage can increase the number of predicted genes. C) Two paralogous genes may be collapsed into a single gene, decreasing the predicted gene count. D) A gene may be partially or entirely missing from the assembly, decreasing the number of predicted genes.

**Figure 2. Differences in gene family size when comparing annotated draft genomes (see Table 1 for individual descriptions) to the chicken reference assembly (v4.0).**
For each gene family, the size (in total number of genes predicted) was compared to the chicken reference; positive numbers indicate an excess number of genes in the draft genome annotations, while negative numbers indicate a deficit of genes. The small number of gene families with more than +/−3 differences from the reference are not shown. Gene models were predicted using GENSCAN.

**Figure 3. Differences in gene family size when comparing the initial chimpanzee assembly (Pan_troglodytes-1.0) to an updated version (Pan_troglodytes-2.1).**
Positive numbers indicate an excess number of genes in v1.0, while negative numbers indicate an excess. The small number of gene families with more than +/−3 differences from the reference are not shown.

**Figure 4. Number of predicted genes increases with increasing genome fragmentation.**
Starting with the *D. melanogaster* reference genome (release 5.41), the sequence was cut into increasing numbers of “contigs.” GENSCAN gene predictions for each assembly are shown.

**Figure 5. Number of predicted exons per gene decreases with increased genome fragmentation.**
A comparison of the number of predicted exons per gene in the uncut *D. melanogaster* reference genome to the fragmented version of this genome that contains 17,941 contigs (the right-most point in Fig. 4). Gene models were predicted using GENSCAN.

See this image and copyright information in PMC

References

1. Demuth JP, De Bie T, Stajich JE, Cristianini N, Hahn MW (2006) The evolution of mammalian gene families. PLoS ONE 1: e85. - PMC - PubMed
1. Hahn MW, Han MV, Han S-G (2007) Gene family evolution across 12 Drosophila genomes. PLoS Genetics 3: e197. - PMC - PubMed
1. Floudas D, Binder M, Riley R, Barry K, Blanchette RA, et al. (2012) The paleozoic origin of enzymatic lignin decomposition reconstructed from 31 fungal genomes. Science 336: 1715–1719. - PubMed
1. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature 464: 704–712. - PMC - PubMed
1. Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M (2008) Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster . Science 320: 1629–1631. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extensive error in the number of genes inferred from draft genome assemblies

Affiliations

Extensive error in the number of genes inferred from draft genome assemblies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases