Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jan;43(2):e11.
doi: 10.1093/nar/gku1187. Epub 2014 Nov 17.

Reference-free detection of isolated SNPs

Affiliations

Reference-free detection of isolated SNPs

Raluca Uricaru et al. Nucleic Acids Res. 2015 Jan.

Abstract

Detecting single nucleotide polymorphisms (SNPs) between genomes is becoming a routine task with next-generation sequencing. Generally, SNP detection methods use a reference genome. As non-model organisms are increasingly investigated, the need for reference-free methods has been amplified. Most of the existing reference-free methods have fundamental limitations: they can only call SNPs between exactly two datasets, and/or they require a prohibitive amount of computational resources. The method we propose, discoSnp, detects both heterozygous and homozygous isolated SNPs from any number of read datasets, without a reference genome, and with very low memory and time footprints (billions of reads can be analyzed with a standard desktop computer). To facilitate downstream genotyping analyses, discoSnp ranks predictions and outputs quality and coverage per allele. Compared to finding isolated SNPs using a state-of-the-art assembly and mapping approach, discoSnp requires significantly less computational resources, shows similar precision/recall values, and highly ranked predictions are less likely to be false positives. An experimental validation was conducted on an arthropod species (the tick Ixodes ricinus) on which de novo sequencing was performed. Among the predicted SNPs that were tested, 96% were successfully genotyped and truly exhibited polymorphism.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
discoSnp method diagram. discoSnp is composed of two modules, KisSnp2 and KissReads that are called by the run_disco.sh script.
Figure 2.
Figure 2.
Toy example of a bubble in the de Bruijn Graph (k = 4). Bubble generated by a single nucleotide polymorphism. The two polymorphic sequences are …CTGACCT… and …CTGTCCT
Figure 3.
Figure 3.
Examples of non-symmetrically branching bubbles (a and b) and symmetrically branching bubbles (c and d). Path divergences in bubbles a and b create branching bubbles, but the branching is not symmetric. Divergence is only present in one path (a) or in both paths but with distinct (circled) characters (b). Path divergences in bubbles c and d create symmetrically branching bubbles. Both paths of these bubbles can be right extended (c) or left extended (d) with the same two (circled) characters. With option b 0 (default), none of these bubbles would be considered as a SNP, with option −b 1 bubbles a and b would have been considered as SNP while with option −b 2 all of them would have been output.
Figure 4.
Figure 4.
Read-coherency and k-read-coherency example. With coverage threshold = 2, schematic example where a sequence is read-coherent but not k-read-coherent. The leftmost represented k-mer (green) on the sequence is an example where the k-mer starting at this position is covered with three mapped reads. On the other hand, the rightmost represented k-mer (red) is covered by no read, thus illustrating why the sequence is not k-read-coherent.
Figure 5.
Figure 5.
discoSnp, cortex and hybrid strategy (soap + gatk) results, depending on the number of input haploid individuals. Soap and gatk were launched with default parameters. For discoSnp and cortex, k-mers having three or fewer occurrences in all datasets were removed. (a) Precision and recall: filled symbols represent the precision and empty symbols represent the recall. (b) Time and memory performances for two (left part) and 30 (right part) individuals.
Figure 6.
Figure 6.
Repartition of SNPs detected by discoSnp depending of their phi coefficient. FP are false positives and TP are true positives. Homo (resp. hetero) stands for homozygous (resp. heterozygous) SNPs. True positive SNPs are then classified according to their genotype in the two individuals.
Figure 7.
Figure 7.
Comparative results of discoSnp, cortex, bubbleparse and the hybrid SOAPdenovo2 + Bowtie2 + GATK approaches on the two diploid human chromosome 1 dataset. Precision versus recall curves are obtained by ranking the predicted SNPs. Each data point is obtained at a given rank threshold, where precision and recall values are computed for all SNPs with better ranks than this threshold. In this framework cortex does not rank the predicted SNPs, its results are thus represented by a single point. Plain lines for discoSnp and bubbleparse were obtained while discarding all branching bubbles (options −b 0 and depth = 0 respectively), whereas dotted lines were obtained when allowing for some branchings (options −b 1 and depth = 1 respectively).
Figure 8.
Figure 8.
Comparative memory and time performances on the human chromosome 1 dataset. Time values are given with options depth = 1 for bubbleparse and −b 1 for discoSnp.
Figure 9.
Figure 9.
Venn diagrams of isolated SNPs detected by Wong et al. (18) (IS set) and by discoSnp. Left: Raw discoSnp results. Right: Filtered discoSnp results, i.e. SNPs with Phi ≥0.2.

References

    1. Xu X., Liu X., Ge S., Jensen J. D., Hu F., Li X., Dong Y., Gutenkunst R. N., Fang L., Huang L., et al. Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat. Biotechnol. 2012;30:105–111. - PubMed
    1. Quillery E., Quenez O., Peterlongo P., Plantard O. Development of genomic resources for the tick Ixodes ricinus: isolation and characterization of single nucleotide polymorphisms. Mol. Ecol. Resour. 2014;14:393–400. - PubMed
    1. DePristo M.A., Banks E., Poplin R., Garimella K.V., Maguire J.R., Hartl C., Philippakis A.A., del Angel G., Rivas M.A., Hanna M., et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 2011;43:491–498. - PMC - PubMed
    1. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. - PMC - PubMed
    1. Li H. Exploring single-sample snp and indel calling with whole-genome de novo assembly. Bioinformatics. 2012;28:1838–1844. - PMC - PubMed

Publication types