Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Dec;18(12):2024-33.
doi: 10.1101/gr.080200.108. Epub 2008 Sep 25.

Sequencing of natural strains of Arabidopsis thaliana with short reads

Affiliations

Sequencing of natural strains of Arabidopsis thaliana with short reads

Stephan Ossowski et al. Genome Res. 2008 Dec.

Abstract

Whole-genome hybridization studies have suggested that the nuclear genomes of accessions (natural strains) of Arabidopsis thaliana can differ by several percent of their sequence. To examine this variation, and as a first step in the 1001 Genomes Project for this species, we produced 15- to 25-fold coverage in Illumina sequencing-by-synthesis (SBS) reads for the reference accession, Col-0, and two divergent strains, Bur-0 and Tsu-1. We aligned reads to the reference genome sequence to assess data quality metrics and to detect polymorphisms. Alignments revealed 823,325 unique single nucleotide polymorphisms (SNPs) and 79,961 unique 1- to 3-bp indels in the divergent accessions at a specificity of >99%, and over 2000 potential errors in the reference genome sequence. We also identified >3.4 Mb of the Bur-0 and Tsu-1 genomes as being either extremely dissimilar, deleted, or duplicated relative to the reference genome. To obtain sequences for these regions, we incorporated the Velvet assembler into a targeted de novo assembly method. This approach yielded 10,921 high-confidence contigs that were anchored to flanking sequences and harbored indels as large as 641 bp. Our methods are broadly applicable for polymorphism discovery in moderate to large genomes even at highly diverged loci, and we established by subsampling the Illumina SBS coverage depth required to inform a broad range of functional and evolutionary studies. Our pipeline for aligning reads and predicting SNPs and indels, SHORE, is available for download at http://1001genomes.org.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Errors by read position and prb values. (A) Distribution of observed errors by position in reads. (B) Relationship between prb values and observed errors with the frequency spectrum for prb values based on all called bases. All data are from uniquely aligned Col-0 reads.
Figure 2.
Figure 2.
Performance evaluation for sequence predictions from aligned reads. Specificity (top) and sensitivity (bottom) for Bur-0 by genome coverage depth (see legend) for a range of minimum read supports. Approximate genome coverage estimates for each subsample of the data are based on the data in Table 1 (see also Supplemental Methods).
Figure 3.
Figure 3.
Targeted de novo assembly. (A) Example of alignment of Bur-0 reads to the reference (“Ref.”) sequence. Columns of high-quality mismatches (red) identify SNPs. A stretch of nucleotides without overlapping reads defined a target for de novo assembly (gray shading). Masked mismatches are highlighted in yellow. (B) Targeted-assembly derived Bur-0 contig for the same region, with reads added from the pool of unmapped (leftover) reads (green). Flanking SNPs identified in the mapping were recovered in the assembly, as was a complex sequence, which included two adjacent insertions and four SNPs in Bur-0 compared with the reference. The Bur-0 sequence was validated by PCR amplification and dideoxy sequencing. Mismatches to the contig sequence are highlighted in light purple.
Figure 4.
Figure 4.
Detection of duplicated sequences using read coverage and sequence criteria. (A) Expected vs. observed read coverage in Col-0 for a region on chromosome 4 (positions are given at bottom). (B) Analogous region in Bur-0 harboring a predicted duplication. (Vertical lines) Seven CVPs in the region with elevated observed-to-expected coverage. The relative support for each base at a CVP is indicated below (pie charts). (C) Pseudotraces (see Supplemental Fig. S1 in Clark et al. [2007]) from resequencing array data, for five of the seven CVPs for which data was available (see Supplemental Methods). Compared with Col-0, double peaks are apparent in Bur-0 that match, in sequence, bases identified in the short-read data.
Figure 5.
Figure 5.
Distribution of premature stop and frameshift mutations within coding regions. The frequency of premature stop codons and frameshift mutations was increased toward the 3′ ends of coding regions. In addition, frameshift mutations were overrepresented at the 5′ end, potentially compatible with alternative splicing or alternative use of initiation codons.

References

    1. The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. - PubMed
    1. Bomblies K., Lempe J., Epple P., Warthmann N., Lanz C., Dangl J.L., Weigel D. Autoimmune response as a mechanism for a Dobzhansky–Muller-type incompatibility syndrome in plants. PLoS Biol. 2007;5:e236. doi: 10.1371/journal.pbio.0050236. - DOI - PMC - PubMed
    1. Borevitz J., Liang D., Plouffe D., Chang H., Zhu T., Weigel D., Berry C., Winzeler E., Chory J. Large-scale identification of single-feature polymorphisms in complex genomes. Genome Res. 2003;13:513–523. - PMC - PubMed
    1. Borevitz J.O., Hazen S.P., Michael T.P., Morris G.P., Baxter I.R., Hu T.T., Chen H., Werner J.D., Nordborg M., Salt D.E., et al. Genome-wide patterns of single-feature polymorphism in Arabidopsis thaliana. Proc. Natl. Acad. Sci. 2007;104:12057–12062. - PMC - PubMed
    1. Butler J., MacCallum I., Kleber M., Shlyakhter I., Belmonte M., Lander E., Nusbaum C., Jaffe D. ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed

Publication types

Associated data