Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jun 10:8:e9291.
doi: 10.7717/peerj.9291. eCollection 2020.

DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics

Affiliations

DiscoSnp-RAD: de novo detection of small variants for RAD-Seq population genomics

Jérémy Gauthier et al. PeerJ. .

Abstract

Restriction site Associated DNA Sequencing (RAD-Seq) is a technique characterized by the sequencing of specific loci along the genome that is widely employed in the field of evolutionary biology since it allows to exploit variants (mainly Single Nucleotide Polymorphism-SNPs) information from entire populations at a reduced cost. Common RAD dedicated tools, such as STACKS or IPyRAD, are based on all-vs-all read alignments, which require consequent time and computing resources. We present an original method, DiscoSnp-RAD, that avoids this pitfall since variants are detected by exploiting specific parts of the assembly graph built from the reads, hence preventing all-vs-all read alignments. We tested the implementation on simulated datasets of increasing size, up to 1,000 samples, and on real RAD-Seq data from 259 specimens of Chiastocheta flies, morphologically assigned to seven species. All individuals were successfully assigned to their species using both STRUCTURE and Maximum Likelihood phylogenetic reconstruction. Moreover, identified variants succeeded to reveal a within-species genetic structure linked to the geographic distribution. Furthermore, our results show that DiscoSnp-RAD is significantly faster than state-of-the-art tools. The overall results show that DiscoSnp-RAD is suitable to identify variants from RAD-Seq data, it does not require time-consuming parameterization steps and it stands out from other tools due to its completely different principle, making it substantially faster, in particular on large datasets.

Keywords: Deletions; Insertions; RAD-seq; Reference-free; SNPs; Variants.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. Examples of bubbles detected by SNPs in a toy de Bruijn graph, with k = 4.
In (A) the bubble is complete: this corresponds to a bubble detected by DiscoSnp++. In (B), the bubble is symmetrically truncated: it is composed of a branching node (“ACTG”) whose two successors lead to two distinct paths that both have the same length and such that their last two nodes have no successor. The graph (C) shows an example of two bubbles from the same locus. The leftmost bubble contains two symmetrically branching crossroads.
Figure 2
Figure 2. Recall (A), precision (B), time (C) and space (D) evolution on simulated data with different sampling sizes.
For the sampling of 100 samples, five parameter sets were tested for IPyRAD and STACKS (see “Material and Methods” for details).
Figure 3
Figure 3. Recall and precision on simulated data of 100 samples using DiscoSnp-RAD with respect to (A).
k-mer sizes, (B) maximal number of authorized SNP per bubble, (C) maximal number of authorized substitutions while mapping reads on predicted variants sequences, and (D) maximal number of symmetrically branching crossroads. Dashed vertical line represents on each plot the chosen default value.
Figure 4
Figure 4. (A) RAxML phylogeny realized on all variants predicted by DiscoSnp-RAD.
Bootstrap node supports > 80 are shown denoted by gray dots, bootstrap node supports > 90 are shown denoted by black dots. (B) STRUCTURE results obtained with SNP only and all variants on the seven Chiastocheta species. (C) Plot of the two first PC from a multivariate analysis on C. lophota samples and (D) their geographic distribution (figure made with Natural Earth Contributors (2020)).

References

    1. Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA. Harnessing the power of radseq for ecological and evolutionary genomics. Nature Reviews Genetics. 2016;17(2):81–92. doi: 10.1038/nrg.2015.28. - DOI - PMC - PubMed
    1. Catchen J, Hohenlohe PA, Bassham S, Amores A, Cresko WA. Stacks: an analysis tool set for population genomics. Molecular Ecology. 2013;22(11):3124–3140. doi: 10.1111/mec.12354. - DOI - PMC - PubMed
    1. Eaton DAR. Pyrad: assembly of de novo radseq loci for phylogenetic analyses. Bioinformatics. 2014;30(13):1844–1849. doi: 10.1093/bioinformatics/btu121. - DOI - PubMed
    1. Eaton DAR, Overcast I. ipyrad: interactive assembly and analysis of RADseq datasets. Bioinformatics. 2020;btz966(8):2592–2594. doi: 10.1093/bioinformatics/btz966. - DOI - PubMed
    1. Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotyping-by-sequencing (gbs) approach for high diversity species. PLOS ONE. 2011;6(5):1–10. - PMC - PubMed

LinkOut - more resources