Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 26;18(9):e1010552.
doi: 10.1371/journal.pcbi.1010552. eCollection 2022 Sep.

Towards mouse genetic-specific RNA-sequencing read mapping

Affiliations

Towards mouse genetic-specific RNA-sequencing read mapping

Nastassia Gobet et al. PLoS Comput Biol. .

Abstract

Genetic variations affect behavior and cause disease but understanding how these variants drive complex traits is still an open question. A common approach is to link the genetic variants to intermediate molecular phenotypes such as the transcriptome using RNA-sequencing (RNA-seq). Paradoxically, these variants between the samples are usually ignored at the beginning of RNA-seq analyses of many model organisms. This can skew the transcriptome estimates that are used later for downstream analyses, such as expression quantitative trait locus (eQTL) detection. Here, we assessed the impact of reference-based analysis on the transcriptome and eQTLs in a widely-used mouse genetic population: the BXD panel of recombinant inbred lines. We highlight existing reference bias in the transcriptome data analysis and propose practical solutions which combine available genetic variants, genotypes, and genome reference sequence. The use of custom BXD line references improved downstream analysis compared to classical genome reference. These insights would likely benefit genetic studies with a transcriptomic component and demonstrate that genome references need to be reassessed and improved.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overview of strategies to utilize genomic variants in transcriptome read mapping in inbred mouse lines.
A. BXD mouse recombinant inbred panel. Samples came from mice that are: BXD advanced recombinant inbred lines, their parental inbred strains; i.e., C57BL/6J (B6) and DBA/2J (D2), and first generation cross between the parental strains (F1). B. The 3 RNA-seq read mapping strategies used in this study. In the ‘two parental assemblies’ strategy (left), the reads of all samples were mapped to the classical mouse genome assembly (GRCm38 or mm10) and to the D2 assembly. The ‘BXD-specific references’ (middle) were made from GRCm38 and BXD-specific variants. There is one reference for each BXD line, and the reads of each sample were mapped to the corresponding reference. The ‘two parental references’ (right panel) is an intermediate strategy in which the D2-specific reference was built from GRCm38 assembly and D2-specific variants. C. BXD genotypes available from GeneNetwork (genotypes) and D2-specific genomic variants (SNVs, indels, SVs) available from dbSNP. D. Genotypes imputation workflow. D2 haplotype blocks were delineated based on available genotypes in the BXD lines. D2-specific variants within these D2 blocks were included in the BXD-specific references. B6 regions or alleles are in black, D2 regions or alleles are in brown.
Fig 2
Fig 2. Two parental assemblies strategy.
A. Mappability of all samples on 2 parental assemblies (samples are mapped on GRCm38: black symbols and on D2 assembly: brown symbols) using permissive mapping setting (STAR default) in cortex (left) and liver (right). Mappability was estimated as the number of uniquely mapped reads expressed as the % of all reads. B. Mappability in samples from the parental strains and their reciprocal F1 offspring (BxD and DxB) on the 2 parental assemblies using restrictive mapping setting allowing 0 mismatches. Same legend than in A. C. Mappability of parental and F1 samples on 2 parental assemblies using restrictive mapping setting but allowing up to 10 mismatches. Same legend than in A. D. Differential mapping (DM) analysis of D2 assembly compared to GRCm38 in the cortex (left) or in the liver (right). Genes are classified as DM genes if FDR adjusted p-value < 0.05 (red) or non DM genes otherwise (black).
Fig 3
Fig 3. Line-specific references strategy.
A. Relative mappability of customized D2-specific reference (GRCm38 modified with D2-specific indels and SNVs from dbSNP) compared to GRCm38 on parental and F1 samples with exact matches. Samples are all NSD. Colors indicate genetic of the samples: B6 (black), D2 (light brown), and F1 (white) between B6 and D2 strains. The F1 samples are BxD if the mother is B6 and the father is D2 (as for the BXD lines), or the reverse for DxB. B. Differential mapping (DM) analysis of BXD-specific references compared to GRCm38, in the cortex (left) or in the liver (right). Genes are classified as DM genes if the FDR adjusted p-value < 0.05 (red) or non DM genes otherwise (black). C. Relative mappability of BXD-specific references (GRCm38 modified for each BXD line with GeneNetwork genotypes and imputed variants) compared to GRCm38 on BXD samples with exact matches.
Fig 4
Fig 4. Consequences of mapping reference at local eQTL level.
A. Percentage of significant (FDR 5%) local eQTLs over all expressed genes with GRCm38 or BXD-specific references. B. Percentage of skewness of significant (FDR 5%) local eQTLs slope over all expressed genes with GRCm38 or BXD-specific references. C. For all expressed genes, the best local genetic marker to explain gene expression was selected. The Venn diagrams represent the overlap of this analysis between GRCm38 and BXD-specific references for the three criteria in cortex NSD (left) or in the liver NSD (right). The marker (in green) indicates changing the reference result in the same genetic marker associated with gene expression. The slope (in blue) is the direction and strength of allele-specific gene expression, it is considered to be overlapping between the references if it varies less than 5%. The qvalue (in pink) is the statistical significance of the marker to gene expression association, it is considered to be overlapping between the references if it varies less than 5%.
Fig 5
Fig 5. Evaluating mapping parameters.
A. The performance on local eQTLs of selected mapping settings on cortex samples (average of the NSD and SD conditions) is measured by the percentage of expressed genes that have a significant local eQTL. The BXD-specific references were used. C. As in A but for liver samples.

References

    1. Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al.. Modernizing Reference Genome Assemblies. PLOS Biol. 2011. Jul 5;9(7):e1001091. - PMC - PubMed
    1. Church DM, Schneider VA, Steinberg KM, Schatz MC, Quinlan AR, Chin CS, et al.. Extending reference assembly models. Genome Biol. 2015. Jan 24;16(1):13. doi: 10.1186/s13059-015-0587-3 - DOI - PMC - PubMed
    1. Liu X, MacLeod JN, Liu J. iMapSplice: Alleviating reference bias through personalized RNA-seq alignment. PLOS ONE. 2018. Aug 10;13(8):e0201554. doi: 10.1371/journal.pone.0201554 - DOI - PMC - PubMed
    1. Rivas-Astroza M, Xie D, Cao X, Zhong S. Mapping personal functional data to personal genomes. Bioinformatics. 2011. Dec 15;27(24):3427–9. doi: 10.1093/bioinformatics/btr578 - DOI - PMC - PubMed
    1. Groza C, Kwan T, Soranzo N, Pastinen T, Bourque G. Personalized and graph genomes reveal missing signal in epigenomic data. Genome Biol. 2020. May 25;21(1):124. doi: 10.1186/s13059-020-02038-8 - DOI - PMC - PubMed

Publication types