Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(3):e58815.
doi: 10.1371/journal.pone.0058815. Epub 2013 Mar 26.

Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data

Affiliations

Development of strategies for SNP detection in RNA-seq data: application to lymphoblastoid cell lines and evaluation using 1000 Genomes data

Emma M Quinn et al. PLoS One. 2013.

Abstract

Next-generation RNA sequencing (RNA-seq) maps and analyzes transcriptomes and generates data on sequence variation in expressed genes. There are few reported studies on analysis strategies to maximize the yield of quality RNA-seq SNP data. We evaluated the performance of different SNP-calling methods following alignment to both genome and transcriptome by applying them to RNA-seq data from a HapMap lymphoblastoid cell line sample and comparing results with sequence variation data from 1000 Genomes. We determined that the best method to achieve high specificity and sensitivity, and greatest number of SNP calls, is to remove duplicate sequence reads after alignment to the genome and to call SNPs using SAMtools. The accuracy of SNP calls is dependent on sequence coverage available. In terms of specificity, 89% of RNA-seq SNPs calls were true variants where coverage is >10X. In terms of sensitivity, at >10X coverage 92% of all expected SNPs in expressed exons could be detected. Overall, the results indicate that RNA-seq SNP data are a very useful by-product of sequence-based transcriptome analysis. If RNA-seq is applied to disease tissue samples and assuming that genes carrying mutations relevant to disease biology are being expressed, a very high proportion of these mutations can be detected.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Analysis strategies and methods for RNA-seq SNP detection.
This figure outlines the analysis strategies and methods used to identify the best performing methods of RNA-seq SNP detection. We analyzed our data by removing duplicates pre-alignment (strategy A) and post-alignment (strategy B). Reads were aligned to either the genome or the transcriptome and SNP calls generated using SAMtools and GATK. This produced 8 sets of calls for analysis (see table S2a).
Figure 2
Figure 2. Number of SNPs per method in RNA-seq data.
This figure displays the number of SNPs called for each of the 8 methods used. The proportion of heterozygous (grey) and homozygous (black) SNP calls is also displayed. Details of the numbers of SNPs called are listed in table S2a.
Figure 3
Figure 3. Specificity and sensitivity of the SNP calls from RNA-seq data.
This figure displays the specificity (A) and sensitivity (B) of the SNP calls for each of the 8 methods at a range of coverage depths. Solid lines represent calls made where duplicate reads had been removed pre-alignment and broken lines are calls generated when duplicate reads are removed post-alignment.
Figure 4
Figure 4. Specificity and sensitivity of heterozygous and homozygous SNP calls from RNA-seq data.
This figure displays the specificity (A) and sensitivity (B) for heterozygous and homozygous SNP calls for the post_genome_gatk calling method at a range of coverage depths.
Figure 5
Figure 5. Specificity and sensitivity of the SNP calls from RNA-seq data for all three samples.
This figure displays the specificity (A) and sensitivity (B) of the SNP calls for each of the three samples (in colour) at a range of coverage depths using the post_genome_sam method. The black lines plot the averages of all three samples plus 95% confidence intervals.

References

    1. Hoheisel JD (2006) Microarray technology: beyond transcript profiling and genotype analysis. Nat Rev Genet 7: 200–210. - PubMed
    1. Gresham D, Dunham MJ, Botstein D (2008) Comparing whole genomes using DNA microarrays. Nat Rev Genet 9: 291–302. - PMC - PubMed
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y (2008) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18: 1509–1517. - PMC - PubMed
    1. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10: 57–63. - PMC - PubMed
    1. Mudge J, Miller NA, Khrebtukova I, Lindquist IE, May GD, et al. (2008) Genomic Convergence Analysis of Schizophrenia: mRNA Sequencing Reveals Altered Synaptic Vesicular Transport in Post-Mortem Cerebellum. PLoS ONE 3: e3625. - PMC - PubMed

Publication types