Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression

Narayanan Raghupathy¹, Kwangbom Choi¹, Matthew J Vincent¹, Glen L Beane¹, Keith S Sheppard¹, Steven C Munger¹, Ron Korstanje¹, Fernando Pardo-Manual de Villena², Gary A Churchill¹

Affiliations

PMID: 29444201
PMCID: PMC6022640
DOI: 10.1093/bioinformatics/bty078

Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression

Narayanan Raghupathy et al. Bioinformatics. 2018.

. 2018 Jul 1;34(13):2177-2184.

doi: 10.1093/bioinformatics/bty078.

Authors

Narayanan Raghupathy¹, Kwangbom Choi¹, Matthew J Vincent¹, Glen L Beane¹, Keith S Sheppard¹, Steven C Munger¹, Ron Korstanje¹, Fernando Pardo-Manual de Villena², Gary A Churchill¹

Affiliations

¹ The Jackson Laboratory, Bar Harbor, USA.
² Department of Genetics, The University of North Carolina, Chapel Hill, USA.

PMID: 29444201
PMCID: PMC6022640
DOI: 10.1093/bioinformatics/bty078

Abstract

Motivation: Allele-specific expression (ASE) refers to the differential abundance of the allelic copies of a transcript. RNA sequencing (RNA-seq) can provide quantitative estimates of ASE for genes with transcribed polymorphisms. When short-read sequences are aligned to a diploid transcriptome, read-mapping ambiguities confound our ability to directly count reads. Multi-mapping reads aligning equally well to multiple genomic locations, isoforms or alleles can comprise the majority (>85%) of reads. Discarding them can result in biases and substantial loss of information. Methods have been developed that use weighted allocation of read counts but these methods treat the different types of multi-reads equivalently. We propose a hierarchical approach to allocation of read counts that first resolves ambiguities among genes, then among isoforms, and lastly between alleles. We have implemented our model in EMASE software (Expectation-Maximization for Allele Specific Expression) to estimate total gene expression, isoform usage and ASE based on this hierarchical allocation.

Results: Methods that align RNA-seq reads to a diploid transcriptome incorporating known genetic variants improve estimates of ASE and total gene expression compared to methods that use reference genome alignments. Weighted allocation methods outperform methods that discard multi-reads. Hierarchical allocation of reads improves estimation of ASE even when data are simulated from a non-hierarchical model. Analysis of RNA-seq data from F1 hybrid mice using EMASE reveals widespread ASE associated with cis-acting polymorphisms and a small number of parent-of-origin effects.

Availability and implementation: EMASE software is available at https://github.com/churchill-lab/emase.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Multi-read proportions in hybrid mouse data. For each read, we determined whether it aligns to multiple genomic locations, multiple isoforms of a gene and multiple alleles. If, for example, a read is a genomic multi-read and is also an isoform multi-read for at least one of its genomic alignments, the read is counted as an isoform multi-read. Complex multi-reads are shown at the intersections of the Venn diagram. The proportion of reads that align uniquely at all levels is 14.1% as shown

**Fig. 2.**
Hierarchical allocation of multi-reads. (a) The EMASE model hierarchies are illustrated for a gene $(g)$ with two alleles $(a_{1}, a_{2})$ and three isoforms $(i_{1}, i_{2}, i_{3})$ . The model hierarchy determines the order in which the alignments of a multi-read are resolved. For example, under EMASE model 1 (M₁), we first account for genomic multi-read alignments, then allele alignments and isoform alignments are last to be resolved. Under EMASE model 4 (M₄), all alignments of a multi-read are treated equally and are resolved without any order. (b) Probabilistic allocation of a complex multi-read. The alignment profile (left) is an indicator matrix with ‘1’ set at the aligned positions of a multi-read in a diploid transcriptome. Dark gray lines indicate levels of hierarchy within which weights are being allocated. Light gray lines distinguish items in each level of hierarchy. In EMASE, a multi-read is allocated along four different hierarchies. For example, in M₁ a read with the given alignment profile is sequentially allocated at the level of gene, then allele and finally isoform. Note that for models M₁, M₂ and M₃, the presence of three alignments to gene g₁ is counted as a single event and thus the weight allocated to each gene is $\frac{1}{2}$ . Under M₄, each alignment is weighted equally; gene g₁ receives $\frac{3}{4}$ of the total weight and gene g₂ receives $\frac{1}{4}$ . (c) The EMASE parameter estimation algorithm is carried out iteratively. Each read alignment profile (1) is assigned weights in proportion to the current estimates of transcript proportion (2). Then weights are summed to obtain the expected read counts (3). Counts are normalized by their effective transcript length to obtain new estimates of transcript proportions. This cycle is repeated until the transcript proportion parameters converge

See this image and copyright information in PMC

References

1. Agresti A. (2002). Categorical Data Analysis. Wiley Series in Probability and Statistics, 2nd edn. Wiley-Interscience, New York.
1. Baker C.L. et al. (2015) PRDM9 drives evolutionary erosion of hotspots in Mus musculus through haplotype-specific initiation of meiotic recombination. PLoS Genet., 11, e1004916.. - PMC - PubMed
1. Bray N.L. et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34, 525–527. - PubMed
1. Castel S.E. et al. (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol., 16, 195.. - PMC - PubMed
1. Chick J.M. et al. (2016) Defining the consequences of genetic variation on a proteome-wide scale. Nature, 534, 500–505. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- Mouse Genome Informatics (MGI)
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression

Affiliations

Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous