Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 1998 Oct;8(10):1074-84.
doi: 10.1101/gr.8.10.1074.

Analysis of the quality and utility of random shotgun sequencing at low redundancies

Affiliations
Comparative Study

Analysis of the quality and utility of random shotgun sequencing at low redundancies

J Bouck et al. Genome Res. 1998 Oct.

Abstract

The currently favored approach for sequencing the human genome involves selecting representative large-insert clones (100-200 kb), randomly shearing this DNA to construct shotgun libraries, and then sequencing many different isolates from the library. This method, entitled directed random shotgun sequencing, requires highly redundant sequencing to obtain a complete and accurate finished consensus sequence. Recently it has been suggested that a rapidly generated lower redundancy sequence might be of use to the scientific community. Low-redundancy sequencing has been examined previously using simulated data sets. Here we utilize trace data from a number of projects submitted to GenBank to perform reconstruction experiments that mimic low-redundancy sequencing. These low-redundancy sequences have been examined for the completeness and quality of the consensus product, information content, and usefulness for interspecies comparisons. The data presented here suggest three different sequencing strategies, each with different utilities. (1) Nearly complete sequence data can be obtained by sequencing a random shotgun library at sixfold redundancy. This may therefore represent a good point to switch from a random to directed approach. (2) Sequencing can be performed with as little as twofold redundancy to find most of the information about exons, EST hits, and putative exon similarity matches. (3) To obtain contiguity of coding regions, sequencing at three- to fourfold redundancy would be appropriate. From these results, we suggest that a useful intermediate product for genome sequencing might be obtained by three- to fourfold redundancy. Such a product would allow a large amount of biologically useful data to be extracted while postponing the majority of work involved in producing a high quality consensus sequence.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Contig formation at lower redundancy of sequencing. The number of contigs that were larger than 2 kb was calculated for each low redundancy simulation. The fold redundancy of each clone was calculated based on the number of bases that had a Phred value >20. The projects that were examined are listed at right.
Figure 2
Figure 2
Areas containing shallow depth at lower-redundancy sequencing. The lower redundancy-simulated sequences were tested for the occurrence of areas of shallow depth of coverage (see text). The projects examined are listed at right.
Figure 3
Figure 3
Assessment of consensus base quality at lower redundancies of sequencing. The Phrap value generated for each consensus base was examined, and the total number of bases that had a value above 40 were counted. The number of bases containing values >40 is represented as a percentage of the total number of bases in the project. Low values at the termini of contigs were excluded from the totals. A bar (dashes) at the 90% level is shown for reference purposes.
Figure 4
Figure 4
Alignment of contigs generated from low-redundancy sequencing to the known sequence. The lower-redundancy consensus sequences were compared to the completed sequence that was submitted to GenBank. The distance from the start of the project is indicated at the bottom, and the percent identity is indicated at left. Exons (numbered solid boxes) and repeats [SINEs other than mammalian interspersed repeats (MIRs) are light gray triangles pointing toward the A-rich 3′ end; LINE1s are open triangles; MIR and LINE2 elements are solid triangles] are indicated at the top.
Figure 5
Figure 5
Contiguity of genes on project J19 at different levels of redundancy. The number of contigs at each coverage that described a gene was counted. (Top) The genes that contain 10 or more exons and are therefore termed complex; (bottom) the contiguity of genes that contain <10 exons are therefore termed simple. The names of the genes are indicated at right; the number of exons in each gene follows the name.
Figure 6
Figure 6
Contiguity of genes on project P7 at different levels of redundancy. The number of contigs at each coverage that described a gene was counted. (Top) The genes that contain 10 or more exons; (bottom) the genes that contain <10 exons. The names of the genes are indicated at right; the number of exons in each gene follows the name.
Figure 7
Figure 7
Interspecies comparisons at different levels of redundancy. The consensus sequences generated from low-redundancy sequencing simulations were aligned to the completed sequence. (A) Percent identity between the simulated consensus and the finished sequence is shown for a series of coverages. The fold redundancy is indicated at left, the percent identity is indicated at right, and the distance from the beginning of the project is indicated on the bottom. (Top) The exons and repeats are indicated (see legend to Fig. 4 for description). (B) The number of highly homologous regions that exists between the human and mouse projects was counted. The percent of these regions that were sequenced is indicated (Sequenced), as is the percent of these regions that were identified by a homology search (Identified).

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Ansari-Lari MA, Muzny DM, Lu J, Lu F, Lilley CE, Spanos S, Malley T, Gibbs RA. A gene-rich cluster between the CD4 and triosephosphate isomerase genes at human chromosome 12p13. Genome Res. 1996;6:314–326. - PubMed
    1. Ansari-Lari MA, Shen Y, Muzny DM, Lee W, Gibbs RA. Large-scale sequencing in human chromosome 12p13: Experimental and computational gene structure determination. Genome Res. 1997;7:268–280. - PubMed
    1. Ansari-Lari MA, Oeltjen JC, Schwartz S, Zhang Z, Muzny DM, Lu J, Gorrell JH, Chinault AC, Belmont JW, Miller W, Gibbs RA. Comparative sequence analysis of a gene-rich cluster at human chromosome 12p13 and its syntenic region in mouse chromosome 6. Genome Res. 1998;8:29–40. - PubMed
    1. Chissoe SL, Marra MA, Hillier L, Brinkman R, Wilson RK, Waterston RH. Representation of cloned genomic sequences in two sequencing vectors: Correlation of DNA sequence and subclone distribution. Nucleic Acids Res. 1997;25:2960–2966. - PMC - PubMed

Publication types