. 2015 Jan;43(2):e11.

doi: 10.1093/nar/gku1187. Epub 2014 Nov 17.

Reference-free detection of isolated SNPs

Raluca Uricaru¹, Guillaume Rizk², Vincent Lacroix³, Elsa Quillery⁴, Olivier Plantard⁴, Rayan Chikhi⁵, Claire Lemaitre⁶, Pierre Peterlongo⁷

Affiliations

¹ University of Bordeaux, CNRS/LaBRI, F-33405 Talence, France University of Bordeaux, CBiB, F-33000 Bordeaux, France INRA, UMR1349 IGEPP, Le Rheu, France ruricaru@labri.fr.
² GenScale, INRIA Rennes Bretagne-Atlantique, IRISA, Rennes, France.
³ BAMBOO, INRIA Grenoble Rhone-Alpes, Lyon, France Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1 UMR CNRS 5558, Lyon, France.
⁴ INRA, UMR1300 Biology, Epidemiology and Risk Analysis in Animal Health, Nantes, France LUNAM University, Oniris, Nantes Atlantic College of Veterinary Medicine and Food Sciences and Engineering, UMR BioEpAR, Nantes, France.
⁵ Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
⁶ GenScale, INRIA Rennes Bretagne-Atlantique, IRISA, Rennes, France claire.lemaitre@inria.fr.
⁷ GenScale, INRIA Rennes Bretagne-Atlantique, IRISA, Rennes, France pierre.peterlongo@inria.fr.

PMID: 25404127
PMCID: PMC4333369
DOI: 10.1093/nar/gku1187

Reference-free detection of isolated SNPs

Raluca Uricaru et al. Nucleic Acids Res. 2015 Jan.

. 2015 Jan;43(2):e11.

doi: 10.1093/nar/gku1187. Epub 2014 Nov 17.

Authors

Raluca Uricaru¹, Guillaume Rizk², Vincent Lacroix³, Elsa Quillery⁴, Olivier Plantard⁴, Rayan Chikhi⁵, Claire Lemaitre⁶, Pierre Peterlongo⁷

Affiliations

¹ University of Bordeaux, CNRS/LaBRI, F-33405 Talence, France University of Bordeaux, CBiB, F-33000 Bordeaux, France INRA, UMR1349 IGEPP, Le Rheu, France ruricaru@labri.fr.
² GenScale, INRIA Rennes Bretagne-Atlantique, IRISA, Rennes, France.
³ BAMBOO, INRIA Grenoble Rhone-Alpes, Lyon, France Laboratoire de Biométrie et Biologie Évolutive, Université Lyon 1 UMR CNRS 5558, Lyon, France.
⁴ INRA, UMR1300 Biology, Epidemiology and Risk Analysis in Animal Health, Nantes, France LUNAM University, Oniris, Nantes Atlantic College of Veterinary Medicine and Food Sciences and Engineering, UMR BioEpAR, Nantes, France.
⁵ Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
⁶ GenScale, INRIA Rennes Bretagne-Atlantique, IRISA, Rennes, France claire.lemaitre@inria.fr.
⁷ GenScale, INRIA Rennes Bretagne-Atlantique, IRISA, Rennes, France pierre.peterlongo@inria.fr.

PMID: 25404127
PMCID: PMC4333369
DOI: 10.1093/nar/gku1187

Abstract

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.

PubMed Disclaimer

Figures

**Figure 1.**
discoSnp method diagram. discoSnp is composed of two modules, KisSnp2 and KissReads that are called by the *run_disco.sh* script.

**Figure 2.**
Toy example of a *bubble* in the *de Bruijn Graph* (k = 4). Bubble generated by a single nucleotide polymorphism. The two polymorphic sequences are …*CTGACCT*… and …*CTGTCCT*…

**Figure 3.**
Examples of non-symmetrically branching bubbles (a and b) and symmetrically branching bubbles (c and d). Path divergences in bubbles a and b create branching bubbles, but the branching is not symmetric. Divergence is only present in one path (a) or in both paths but with distinct (circled) characters (b). Path divergences in bubbles c and d create symmetrically branching bubbles. Both paths of these bubbles can be right extended (c) or left extended (d) with the same two (circled) characters. With option −b 0 (default), none of these bubbles would be considered as a SNP, with option −b 1 bubbles a and b would have been considered as SNP while with option −b 2 all of them would have been output.

**Figure 4.**
Read-coherency and k-read-coherency example. With coverage threshold = 2, schematic example where a sequence is *read-coherent* but not *k-read-coherent*. The leftmost represented k-mer (green) on the sequence is an example where the k-mer starting at this position is covered with three mapped reads. On the other hand, the rightmost represented k-mer (red) is covered by no read, thus illustrating why the sequence is not *k-read-coherent*.

**Figure 5.**
discoSnp, cortex and hybrid strategy (soap + gatk) results, depending on the number of input haploid individuals. Soap and gatk were launched with default parameters. For discoSnp and cortex, k-mers having three or fewer occurrences in all datasets were removed. (a) Precision and recall: filled symbols represent the precision and empty symbols represent the recall. (b) Time and memory performances for two (left part) and 30 (right part) individuals.

**Figure 6.**
Repartition of SNPs detected by discoSnp depending of their phi coefficient. FP are false positives and TP are true positives. Homo (resp. hetero) stands for homozygous (resp. heterozygous) SNPs. True positive SNPs are then classified according to their genotype in the two individuals.

**Figure 7.**
Comparative results of discoSnp, cortex, bubbleparse and the *hybrid* SOAPdenovo2 + Bowtie2 + GATK approaches on the two diploid human chromosome 1 dataset. Precision versus recall curves are obtained by ranking the predicted SNPs. Each data point is obtained at a given rank threshold, where precision and recall values are computed for all SNPs with better ranks than this threshold. In this framework cortex does not rank the predicted SNPs, its results are thus represented by a single point. Plain lines for discoSnp and bubbleparse were obtained while discarding all branching bubbles (options −b 0 and depth = 0 respectively), whereas dotted lines were obtained when allowing for some branchings (options −b 1 and depth = 1 respectively).

**Figure 8.**
Comparative memory and time performances on the human chromosome 1 dataset. Time values are given with options *depth* = 1 for bubbleparse and −b 1 for discoSnp.

**Figure 9.**
Venn diagrams of isolated SNPs detected by Wong *et al.* (18) (IS set) and by discoSnp. **Left**: Raw discoSnp results. **Right**: Filtered discoSnp results, i.e. SNPs with Phi ≥0.2.

See this image and copyright information in PMC

References

1. Xu X., Liu X., Ge S., Jensen J. D., Hu F., Li X., Dong Y., Gutenkunst R. N., Fang L., Huang L., et al. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 2012;30:105–111. - PubMed
1. Quillery E., Quenez O., Peterlongo P., Plantard O. Development of genomic resources for the tick Ixodes ricinus: isolation and characterization of single nucleotide polymorphisms. Mol. Ecol. Resour. 2014;14:393–400. - PubMed
1. DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., Hanna M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
1. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. - PMC - PubMed
1. Li H. Exploring single-sample snp and indel calling with whole-genome de novo assembly. Bioinformatics. 2012;28:1838–1844. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
- Saccharomyces Genome Database
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reference-free detection of isolated SNPs

Affiliations

Reference-free detection of isolated SNPs

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials

Miscellaneous