Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan 25:12:59.
doi: 10.1186/1471-2164-12-59.

Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence

Affiliations

Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence

Frank M You et al. BMC Genomics. .

Abstract

Background: Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence.

Results: An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated.

Conclusion: An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (http://avena.pw.usda.gov/wheatD/agsnp.shtml).

PubMed Disclaimer

Figures

Figure 1
Figure 1
Annotation pipeline of Roche 454 reads from the Ae. tauschii accession AL8/78 (genotype 1). Predicted single-copy gene-related sequences, uncharacterized sequences and repeat junction sequences will be used for SNP discovery. The processes with dashed boxes are optional depending on whether or not cDNA short reads are available.
Figure 2
Figure 2
Frequency distributions of the depths of NGS reads of AS75 mapped to annotated Roche 454 reads. Except that for characterized repeat reads, the remaining distributions can be approximated to an extreme value distribution. (A) Frequency distributions of the depths of AS75 SOLiD genomic reads (total ~ 26X genome equivalents) mapped to Roche 454 characterized gene reads, repeat junction reads, characterized repeat reads, and uncharacterized reads. Because most gene-related reads are single copy, the frequency distribution of reads mapped to gene-related reads is used as single-copy read distribution. The estimated population mean (X¯) plus two standard deviations (s) (depth of 53X) of this distribution was used as the cut-off depth for considering AS75 SOLiD genomic reads mapped on Roche 454 AL8/78 gene reads, repeat junctions, and uncharacterized reads as single-copy. (B) Frequency distributions of read depths and X¯ + 2s cut-off values for Solexa AS75 genomic reads (~1.56X genome equivalent) and Roche 454 AS75 genomic reads (~0.11X genome equivalent) mapped to characterized gene reads of Roche 454. The distributions were skewed to the left because of low coverage but still could be fitted to an extreme value distribution (a Weibull distribution) [34].
Figure 3
Figure 3
SNP discovery pipeline using Roche 454 reads of the Ae. tauschii accession AL8/78 as a reference.
Figure 4
Figure 4
Roche 454 sequencing errors in relation to base quality score of reads. Over 70% of base substitution errors (arrows) can be filtered out if the SNP base quality score is ≥ 30 and the neighbourhood quality standard (NQS) 11 base score is ≥ 20.
Figure 5
Figure 5
The relationship of Roche 454 sequencing errors to relative error locations in reads.
Figure 6
Figure 6
The abundance and distribution of rice genes homologous to Ae. tauschii genes bearing SNPs across the 12 rice chromosomes. Each heat map track represents a rice chromosome from R01 to R12. The range of number of rice genes homologous to Ae. tauschii genes with SNPs in a bin with a bin width of 0.1 Mb is from 0 (white color) and 11 (deepest blue color).
Figure 7
Figure 7
The number of gene SNPs discovered is significantly correlated to genome coverage of reads mapped to reference sequences. Random samples of SOLiD genomic reads with ~2X (1 run of SOLiD sequencing), ~4X (2 runs), ~6X (3 runs), ~8X (4 runs) and ~10X (5 runs) genome equivalents were used for SNP discovery. The values of genome coverage were estimated based on AS75 reads mapped to the annotated Roche 454 gene reads (Figure 2). The genome coverage of 10.7X (Table 6) based on this method is equivalent to 26.57X (Table 1) which was estimated based on the 4.02 Gb genome size of Ae. tauschii.

References

    1. Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407(6803):513–516. doi: 10.1038/35035083. - DOI - PubMed
    1. Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008;5(3):247–252. doi: 10.1038/nmeth.1185. - DOI - PubMed
    1. Gore MA, Chia JM, Elshire RJ, Sun Q, Ersoz ES, Hurwitz BL, Peiffer JA, McMullen MD, Grills GS, Ross-Ibarra J. et al.A first-generation haplotype map of maize. Science. 2009;326(5956):1115–1117. doi: 10.1126/science.1177837. - DOI - PubMed
    1. Deschamps S, Rota ML, Ratashak JP, Biddle P, Thureen D, Farmer A, Luck S, Beatty M, Nagasawa N, Michael L. et al.Rapid genome-wide single nucleotide polymorphism discovery in soybean and rice via deep resequencing of reduced representation libraries with the Illumina genome analyzer. The Plant Genome. 2010;3(1):53–68. doi: 10.3835/plantgenome2009.09.0026. - DOI
    1. Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, Cregan PB. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics. 2010;11:38. doi: 10.1186/1471-2164-11-38. - DOI - PMC - PubMed

Publication types

MeSH terms