Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010;11(3):R28.
doi: 10.1186/gb-2010-11-3-r28. Epub 2010 Mar 10.

Detection and correction of false segmental duplications caused by genome mis-assembly

Affiliations

Detection and correction of false segmental duplications caused by genome mis-assembly

David R Kelley et al. Genome Biol. 2010.

Abstract

Diploid genomes with divergent chromosomes present special problems for assembly software as two copies of especially polymorphic regions may be mistakenly constructed, creating the appearance of a recent segmental duplication. We developed a method for identifying such false duplications and applied it to four vertebrate genomes. For each genome, we corrected mis-assemblies, improved estimates of the amount of duplicated sequence, and recovered polymorphisms between the sequenced chromosomes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Mis-assembled DCC and DOC. Assemblers may mistakenly form two contigs from the two haplotypes, as shown in (a) where contig A contains heterozygous sequence and contig B contains homozygous sequence (light) on both sides of a matching heterozygous region (dark) (with sequencing reads as lines above them). We refer to A as a duplicated contained contig (DCC). We can identify this situation by finding an alignment between contigs A and B that completely covers contig A and comparing contig A's mate pair links in the original location to those same links when contig A is overlaid on contig B at the location of its alignment, as shown in (b). Dashed curves in (a) indicate distances that are significantly shorter (left side of figure) or longer (right) than expected; solid curves indicate distances that are consistent with specifications. In the situation shown here, we would designate contig A as an erroneous duplication likely to have been caused by haplotype differences. Alternatively, heterozygous sequence may be separated into two contigs that each include some homozygous sequence on opposite ends, as in contigs C and D in (c), which we refer to as duplicated overlapping contigs. If a significant alignment exists between the ends of these contigs and the distances between mate pairs pointing right from contig C and left from contig D better match their expected fragment sizes when the contigs are joined, we designate the region as an erroneous duplication and join the contigs as in (d).
Figure 2
Figure 2
Erroneous duplication lengths. Contigs from chimpanzee, chicken, cow, and dog that are classified by our procedure as mis-assembled erroneous duplications were binned by length at 250 bp resolution. The distribution was similar for each individual species.
Figure 3
Figure 3
Chimpanzee Contig412.192. In (a), Contig412.192 is placed in the chimpanzee assembly on chromosome 1 such that mated reads pointing to the right have compressed mate pair distances and mated reads pointing to the left have stretched mate pair distances. (b) By moving the 1,537 bp contig to a nearby location where it aligns in its entirety at 98.9%, the distances between mated reads become far more consistent with their library insert lengths. Thus, Contig412.192 is classified as a spurious duplication.
Figure 4
Figure 4
SCPEP1 consistent mRNA alignments. Screenshots taken from the NCBI Sequence Viewer displaying the gene model for serine carboxypeptidase 1 (SCPEP1) where green bars represent contigs and mRNA alignments are shown with red bars as alignments to exons. (a) Contig31.166 contains three putative exons. However, it overlaps neighboring Contig31.165 for all of its length (7,162 bp) at 98.6% identity, and mate pairs indicate that the two contigs came from the same position. Every mRNA alignment takes a path through the exons such that only one copy of each duplicated exon is included. (b) When the contig is moved, the extra copies of these three apparently duplicated exons are removed, but all of the alignments remain consistent.
Figure 5
Figure 5
Re-estimated fragment size distribution. The distribution of fragment sizes for chimpanzee library G591P4 is plotted above under three models. The normal distribution with mean and standard deviation given by the NCBI Trace Archive is plotted as 'Normal TA'. A normal distribution re-estimated from the placement of mated reads from the library is plotted as 'Normal re-estimate'. To lessen the effect of outliers, we did an initial estimation of the parameters, filtered out any mate pair distances that were greater than four standard deviations away, and then estimated the parameters again. 'Nonparametric' plots the actual density of mate pair distances after running a cubic smoothing spline. The actual fragment distribution has a mean of 4,500 rather than the 5,000 listed in the Trace Archive and is far tighter around the mean than suggested by the other models. In particular, the 'Normal TA' model would have given us a very inaccurate view of this library, which is one of the largest for chimpanzee with over 2.3 million reads.

References

    1. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, George RA, Lewis SE, Richards S, Ashburner M, Henderson SN, Sutton GG, Wortman JR, Yandell MD, Zhang Q, Chen LX, Brandon RC, Rogers YH, Blazej RG, Champe M, Pfeiffer BD, Wan KH, Doyle C, Baxter EG, Helt G, Nelson CR. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. doi: 10.1126/science.287.5461.2185. - DOI - PubMed
    1. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. doi: 10.1126/science.1072047. - DOI - PubMed
    1. Cheung J, Estivill X, Khaja R, MacDonald JR, Lau K, Tsui LC, Scherer SW. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 2003;4:R25. doi: 10.1186/gb-2003-4-4-r25. - DOI - PMC - PubMed
    1. Nicholas TJ, Cheng Z, Ventura M, Mealey K, Eichler EE, Akey JM. The genomic architecture of segmental duplications and associated copy number variants in dogs. Genome Res. 2009;19:491–499. doi: 10.1101/gr.084715.108. - DOI - PMC - PubMed
    1. Cheung J, Wilson MD, Zhang J, Khaja R, MacDonald JR, Heng HH, Koop BF, Scherer SW. Recent segmental and gene duplications in the mouse genome. Genome Biol. 2003;4:R47. doi: 10.1186/gb-2003-4-8-r47. - DOI - PMC - PubMed

Publication types

LinkOut - more resources