Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Jan 11:11:21.
doi: 10.1186/1471-2164-11-21.

Effort required to finish shotgun-generated genome sequences differs significantly among vertebrates

Collaborators, Affiliations
Comparative Study

Effort required to finish shotgun-generated genome sequences differs significantly among vertebrates

Robert W Blakesley et al. BMC Genomics. .

Abstract

Background: The approaches for shotgun-based sequencing of vertebrate genomes are now well-established, and have resulted in the generation of numerous draft whole-genome sequence assemblies. In contrast, the process of refining those assemblies to improve contiguity and increase accuracy (known as 'sequence finishing') remains tedious, labor-intensive, and expensive. As a result, the vast majority of vertebrate genome sequences generated to date remain at a draft stage.

Results: To date, our genome sequencing efforts have focused on comparative studies of targeted genomic regions, requiring sequence finishing of large blocks of orthologous sequence (average size 0.5-2 Mb) from various subsets of 75 vertebrates. This experience has provided a unique opportunity to compare the relative effort required to finish shotgun-generated genome sequence assemblies from different species, which we report here. Importantly, we found that the sequence assemblies generated for the same orthologous regions from various vertebrates show substantial variation with respect to misassemblies and, in particular, the frequency and characteristics of sequence gaps. As a consequence, the work required to finish different species' sequences varied greatly. Application of the same standardized methods for finishing provided a novel opportunity to "assay" characteristics of genome sequences among many vertebrate species. It is important to note that many of the problems we have encountered during sequence finishing reflect unique architectural features of a particular vertebrate's genome, which in some cases may have important functional and/or evolutionary implications. Finally, based on our analyses, we have been able to improve our procedures to overcome some of these problems and to increase the overall efficiency of the sequence-finishing process, although significant challenges still remain.

Conclusion: Our findings have important implications for the eventual finishing of the draft whole-genome sequences that have now been generated for a large number of vertebrates.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Gaps in comparative-grade finished BAC sequences from ENm001. The human- and comparative-grade finished sequences of the 541 BACs summarized in Table 1 were compared, and the gaps detected in the comparative-grade finished sequence were analyzed. Indicated for each species are the total bases within gaps per Mb of human-grade finished sequence (A), the number of gaps per Mb of human-grade finished sequence (B), and the median size of the gaps in base pairs (C).
Figure 2
Figure 2
Characteristics of gaps in comparative-grade finished BAC sequences from ENm001. Sequences within the gaps (blue bars) summarized in Figure 1 were analyzed with respect to GC (A) and simple repeat (B) content; similar analyses were performed for the entire human-grade finished BAC sequences (orange bars). Results for the 19 species with the greatest differences (i.e., gap sequences vs. total BAC sequences) for each analysis are shown. Each error bar represents the 95% confidence interval.
Figure 3
Figure 3
Gaps in comparative-grade finished BAC sequences from multiple genomic regions. The comparative-grade finished sequences of the 2,031 BACs summarized in Tables 2 and 3 were analyzed for the presence of uncaptured (orange bar) and captured (blue bar) gaps (see text for details). The numbers of captured and uncaptured gaps per Mb averaged across all BACs from each indicated genomic region (A) or species (B) are indicated. Each error bar represents the 95% confidence interval. In A, the data for three additional ENCODE pilot project regions (ENr231, ENr232, and ENr333) are shown for comparison because of the notably high frequency of gaps in their sequences; however, there were not sufficient numbers of sequenced BACs from these regions to qualify for inclusion in the second data set (see text for details).
Figure 4
Figure 4
Variation in the redundancy of sequence reads generated using shotgun libraries prepared with standard and copy-control E. coli strains. A shotgun-subclone library was prepared from each of three BACs [GenBank:AC153092, AC190087, and AC186717] and used to transform either standard DH10B tonA (Std) or copy-control EPI400 (CC) E. coli strains. From each library, paired forward and reverse sequence reads were then generated from randomly selected subclones to produce assemblies that provided an average of eightfold sequence redundancy. Aligned representative regions of the assemblies that highlight differences in sequence redundancy encountered with the two E. coli strains are shown for each BAC. Yellow lines indicate sequence-read redundancies on the upper and lower strands of the indicated sequence contig (black/red line); the horizontal orange lines depict a redundancy value of 10.

Similar articles

Cited by

References

    1. Wilson RK, Mardis ER. In: Genome Analysis: A laboratory manual: Analyzing DNA. Birren B, Green ED, Klapholz S, Myers RM, Roskams J, editor. Vol. 1. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1997. Shotgun sequencing; pp. 397–454.
    1. Green ED. Strategies for the systematic sequencing of complex genomes. Nat Rev Genet. 2001;2:573–583. doi: 10.1038/35084503. - DOI - PubMed
    1. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. doi: 10.1038/nature03001. - DOI - PubMed
    1. Blakesley RW, Hansen NF, Mullikin JC, Thomas PJ, McDowell JC, Maskeri B, Young AC, Benjamin B, Brooks SY, Coleman BI, Gupta J, Ho S-L, Karlins EM, Maduro QL, Stantripop S, Tsurgeon C, Vogt JL, Walker MA, Masiello CA, Guan X. NISC Comparative Sequencing Program. Bouffard GG, Green ED. An intermediate grade of finished genomic sequence suitable for comparative analysis. Genome Res. 2004;14:2235–2244. doi: 10.1101/gr.2648404. - DOI - PMC - PubMed

Publication types