Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep;40(16):e127.
doi: 10.1093/nar/gks425. Epub 2012 May 14.

A new strategy to reduce allelic bias in RNA-Seq readmapping

Affiliations

A new strategy to reduce allelic bias in RNA-Seq readmapping

Ravi Vijaya Satya et al. Nucleic Acids Res. 2012 Sep.

Abstract

Accurate estimation of expression levels from RNA-Seq data entails precise mapping of the sequence reads to a reference genome. Because the standard reference genome contains only one allele at any given locus, reads overlapping polymorphic loci that carry a non-reference allele are at least one mismatch away from the reference and, hence, are less likely to be mapped. This bias in read mapping leads to inaccurate estimates of allele-specific expression (ASE). To address this read-mapping bias, we propose the construction of an enhanced reference genome that includes the alternative alleles at known polymorphic loci. We show that mapping to this enhanced reference reduced the read-mapping biases, leading to more reliable estimates of ASE. Experiments on simulated data show that the proposed strategy reduced the number of loci with mapping bias by ≥ 63% when compared with a previous approach that relies on masking the polymorphic loci and by ≥ 18% when compared with the standard approach that uses an unaltered reference. When we applied our strategy to actual RNA-Seq data, we found that it mapped up to 15% more reads than the previous approaches and identified many seemingly incorrect inferences made by them.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic representation of the enhanced segments added for two SNPs, S1 and S2. The read length is indicated by r. (A) Enhanced segments added when the distance between two adjacent SNPs S1 and S2 is ≥r. No read can overlap both SNPs in this scenario. A single enhanced segment with the non-reference allele is added for each SNP. The enhanced segment extends r − 1 bases on either side of the SNP to ensure an exact match with any read carrying the non-reference allele. (B) Scenario when the distance between S1 and S2 is <r − 1. Because there can be reads that overlap both SNPs, we need to add three segments to cover all possible haplotypes formed by S1 and S2. It is also necessary that none of the added segments is identical to another enhanced segment (or the reference) in any window of length ≥r. This ensures that the read uniquely maps to the reference or one of the enhanced segments. Multiple solutions satisfying these conditions are possible. The figure shows one such possible solution.
Figure 2.
Figure 2.
Mapping results of simulated 35-bp reads for the three approaches. (A) The enhanced reference approach was able to map a much higher percentage of the input reads, especially for higher error rates. (B) Approximately 50% of the mapped reads carried the reference allele, both for the masked reference and the enhanced reference approaches.
Figure 3.
Figure 3.
Histograms of the proportions of mapped reads for the different mapping approaches. (A) Mapping against the unaltered reference showed a clear bias toward the reference allele. (B) Mapping against the masked reference showed that there was no systematic bias, but a significant percentage of the loci were still biased. (C) Mapping against the enhanced reference eliminated the bias at the majority of the loci.
Figure 4.
Figure 4.
Overlaps between loci reported to be allele specific in different methods. Venn diagrams show the overlaps between the number of loci that were positive for ASE at 1% FDR in (A) GM19238 and (B) GM19239. The masked reference approach missed many loci reported by the other two methods. Loci unique to the masked approach were the result of its inability to map reads with one of the alleles and, hence, were false positives. The enhanced reference approach identified many loci that were missed by the other two approaches.

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. - PubMed
    1. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. - PMC - PubMed
    1. Schwartz S, Oren R, Ast G. Detection and removal of biases in the analysis of next-generation sequencing reads. PLoS One. 2011;6:e16685. - PMC - PubMed
    1. Heap GA, Yang JH, Downes K, Healy BC, Hunt KA, Bockett N, Franke L, Dubois PC, Mein CA, Dobson RJ, et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum. Mol. Genet. 2010;19:122–134. - PMC - PubMed
    1. Yan H, Yuan W, Velculescu VE, Vogelstein B, Kinzler KW. Allelic variation in human gene expression. Science. 2002;297:1143. - PubMed

Publication types