Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Sep;37(16):e106.
doi: 10.1093/nar/gkp507. Epub 2009 Jun 15.

Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq

Affiliations

Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq

Iouri Chepelev et al. Nucleic Acids Res. 2009 Sep.

Abstract

Whole-genome resequencing is still a costly method to detect genetic mutations that lead to altered forms of proteins and may be associated with disease development. Since the majority of disease-related single nucleotide variations (SNVs) are found in protein-coding regions, we propose to identify SNVs in expressed exons of the human genome using the recently developed RNA-Seq technique. We identify 12 176 and 10 621 SNVs, respectively, in Jurkat T cells and CD4(+) T cells from a healthy donor. Interestingly, our data show that one copy of the TAL-1 proto-oncogene has a point mutation in 3' UTR and only the mutant allele is expressed in Jurkat cells. We provide a comprehensive dataset for further understanding the cancer biology of Jurkat cells. Our results indicate that this is a cost-effective and efficient strategy to systematically identify SNVs in the expressed regions of the human genome.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The flow chart of single nucleotide variations identification in expressed exons using RNA-Seq.
Figure 2.
Figure 2.
Redundant Reads Filter and SNV probability calculation examples. (A) There are nine reads that map uniquely to the same genomic location (top box). Nucleotide mismatches with reference sequence are highlighted in red. Filter 1 retains a single copy of each read. Thus, only five reads remain after Filter 1 is applied (middle box). There are two U1 reads, two U2 reads and one U0 read in the middle box. Filter 2 randomly selects one U1, one U2 and one U0 read. This leaves three reads at the same genomic location (bottom box). (B) Example of SNV probability calculation. Colored in red is a candidate SNV site. Seven short reads map uniquely to that site. The reference nucleotide is T. Five reads have nucleotides that differ from the reference nucleotide and two reads have nucleotide T at the candidate SNV site. Let the error rate estimated from the total number of U0, U1 and U2 nonredundant reads be q = 0.02. The binomial (random chance) probability to observe two matches and five mismatches at the same location is proportional to q5 (1−q)2. The P-value is given by the binomial probability of observing five or more mismatches in a seven-read alignment and it is equal to 6.5 × 10–8.
Figure 3.
Figure 3.
Demonstration that Redundant Reads Filter is necessary. (A) As described in ‘Material and methods’ section, application of redundant reads filter (Filter 1 + Filter 2) to uniquely mapped reads leaves at most three reads at a given genomic location: one U0, one U1 and one U2 read. By restricting the number of reads that can map to the same genomic location, we reduce false-positive rate of SNV detection. The evidence for presence of SNV comes mainly from overlapping but noncoincident reads. There are many overlapping but noncoincident reads that can cover a single SNV. In fact, there can still be as many as 90 reads of length 30 bp that cover a single SNV after the filtering step. Thus, the statistical power to detect the SNV is not reduced by the filtering procedure. (B) The number of detected (P-value = 10–9) known, i.e. SNPs from dbSNP database, and unknown (novel) SNVs using reads filtered using four different filters: Filter A is the Redundant Reads Filter; Filter B is Filter 1 followed by randomly selecting two reads each from U1 and U2 categories; Filter C is Filter 1 followed by randomly selecting three reads each from U1 and U2 categories; the last filter is an empty filter, i.e. no filtering of unique reads is done. The number of detected known SNVs is not sensitive to the filtering method used, confirming very low false-positive rate among detected known SNVs. However, the number of detected unknown SNVs is much higher for the cases of Filters B, C and No filter than for Filter A, demonstrating high false-positive rates resulting from the use of these alternative filters. Thus, Filter A is the best of four filters.
Figure 4.
Figure 4.
Reads coverage analysis and cost analysis of SNV detection. (A) Percentage of exonic sequences passing coverage threshold. Three curves correspond to different numbers of uniquely mapped nonredundant reads: 13 million (Jurkat), 7 million (random subsample of 50% Jurkat reads) and 26 million (Jurkat + CD4). For example, about 30% of exonic regions are covered at least 5-fold by nonredundant uniquely mapped reads in Jurkat sample. In the combined Jurkat and CD4 sample, about 40% of exonic regions are covered at least 5-fold. (B) Two curves correspond to estimates of sequencing costs for homozygous (red curve) and heterozygous (blue dotted curve) SNV detection in CD4+ sample. About 80% of all homozygous SNVs in expressed (RPKM ≥ 1) exons can be detected using 67 million 30-bp nonredundant unique reads (∼2000 Mbp). At this sequencing depth, about 55% of all heterozygous SNVs in expressed exons can be detected. (See ‘Materials and methods’ section for details on derivation of cost curves).
Figure 5.
Figure 5.
Summary of results. (A) Venn diagram of single nucleotide variants (SNVs) detected in Jurkat and CD4 samples. (B) Summary table of SNVs detected in Jurkat and CD4 samples. Shown in the brackets are numbers of SNVs that are novel, i.e. not present in dbSNP Build 126 database.

Similar articles

Cited by

References

    1. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, Dunford-Shore BH, McGrath S, Hickenbotham M, et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008;456:66–72. - PMC - PubMed
    1. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN. Human Gene Mutation Database (HGMD): 2003 update. Hum. Mutat. 2003;21:577–581. - PubMed
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. - PubMed
    1. Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321:956–960. - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed

Publication types