Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 28;5(1):vbaf098.
doi: 10.1093/bioadv/vbaf098. eCollection 2025.

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Affiliations

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Yizhou Peter Huang et al. Bioinform Adv. .

Abstract

Motivation: The National Cancer Institute Genomic Data Commons (GDC) provides controlled access to sequencing data from thousands of subjects, enabling large-scale study of impactful genetic alterations such as simple and complex germline and structural variants. However, efficient analysis requires significant computational resources and expertise, especially when calling variants from raw sequence reads. To solve these problems, we developed bamSliceR, a R/bioconductor package that builds upon the GenomicDataCommons package to extract aligned sequence reads from cross-GDC meta-cohorts, followed by targeted analysis of variants and effects (including transcript-aware variant annotation from transcriptome-aligned GDC RNA data).

Results: Here, we demonstrate population-scale genomic and transcriptomic analyses with minimal compute burden using bamSliceR, identifying recurrent, clinically relevant sequence, and structural variants in the TARGET acute myeloid leukemia (AML) and BEAT-AML cohorts. We then validate results in the (non-GDC) Leucegene cohort, demonstrating how the bamSliceR pipeline can be seamlessly applied to replicate findings in non-GDC cohorts. These variants directly yield clinically impactful and biologically testable hypotheses for mechanistic investigation.

Availability and implementation: bamSliceR has been submitted to the Bioconductor project, where it is presently under review, and is available on GitHub at https://github.com/trichelab/bamSliceR.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Overview of the bamSliceR scheme. (A) This schematic illustrates the bamSliceR pipeline, designed to efficiently retrieve metrics of variants from target regions across various data types, including DNA-seq genomic BAMs, RNA-seq genomic BAMs, and RNA-seq transcriptome BAMs. (B) bamSliceR utility to retrieve metrics of variants from RNA-Seq transcriptome BAMs.
Figure 2.
Figure 2.
bamSliceR workflow and functionality for querying, downloading, tallying, and annotating variants from BAM files. (A and B) Steps are designed for querying data from GDC. (C) Step can be applied to local stored BAM files. (A) Identify data available via GDC suitable for your query. (B) Download sliced-BAM files from GDC. C. Variants tallying and annotation.
Figure 3.
Figure 3.
Scheme to calculate transcript coordinates of genomic features in GFF3 file.
Figure 4.
Figure 4.
Schematic for determining equivalence classes of transcripts for INDEL variants. To identify equivalence classes of transcripts for INDEL variants, the process beings by disjoining exon regions for each gene in the GFF3 file based on their genomic coordinates. For instance, a single exon may be split into multiple disjoint bins. In the provided example, the first exon generates three bins: Bin1 [T1, T2], Bin2 [T1, T2, T3], and Bin3 [T2, T3]. For each bin, the genomic and transcript-level coordinates of both the bin and the corresponding exon are tabulated (e.g. see the detailed example for Bin2). When analyzing an INDEL, all bins overlapping with the variant’s genomic coordinates are identified. Finally, the transcripts containing the INDEL are determined by intersecting the sets of transcripts associated with the overlapping bins.
Figure 5.
Figure 5.
Detection of UBTF-ITD in the Leucegene project. Top: Plot of soft-clip read counts (sc-counts) within chr17:44210679-44211356 (GRCh38) of 452 patients in the Leucegene project. Two patients (04H039 and 07H019) show high soft-clipped read counts at hotspots of UBTF-ITD. 04H039 has 57, 190, 67 sc-counts at chr17:44210812, chr17:44210816, and chr17:44210853, respectively (highlight in blue dots). 07H019 has 161, 65, 66, 168 sc-counts at chr17:44210803, chr17:44210806, chr17:44210841, and chr17:44210849, respectively (highlight in yellow dots). Manual examination of RNA-seq BAM files of 18 patients with a maximum sc-counts >25 within the ranges shows no UBTF-ITD. Bottom: Integrative genomics viewer (IGV) visualization showing UBTF-ITD with soft-clipped reads and increased coverage in UBTF exon 13 in patients 04H039 and 07H019.
Figure 6.
Figure 6.
Summary of key features of tools on sequencing-based variant analysis. ProteinPaint is an interactive web-based tool designed by St. Jude (https://proteinpaint.stjude.org/) for visualizing validated genetic variants across databases. It utilized BAM slicing API from GDC to provide visualization of sequencing reads by defined genomic regions. Km is an alignment-free targeted variant detection tool that rapidly identifies known mutations from sequencing data using k-mer matching. HaplotypeCaller from GATK is a genome-wide variant caller that uses local de novo assembly to identify SNPs and indels with high accuracy. bamSliceR integrated BAM slicing API and downstream variants pileups/annotation functions in R environment. It aims to provide a relatively straightforward and efficient workflow to allow users to perform sequence-read-based variant detection in defined genomic regions across databases. More importantly. bamSliceR provides flexibility and benefits to users who want to apply more rigorous variants callers using sliced BAM files.

Update of

References

    1. Audemard EO, Gendron P, Feghaly A et al. Targeted variant detection using unaligned RNA-seq reads. Life Sci Alliance 2019;2. - PMC - PubMed
    1. Barr C, Wu T, Lawrence M. gmapR: an R interface to the GMAP/GSNAP/GSTRUCT suite. [Computer software]. R package. Version 1.17.1. 2016.
    1. Boileau M, Shirinian M, Gayden T et al. Mutant H3 histones drive human pre-leukemic hematopoietic stem cell expansion and promote leukemic aggressiveness. Nat Commun 2019;10:2891. - PMC - PubMed
    1. Bolouri H, Farrar JE, Triche T et al. The molecular landscape of pediatric acute myeloid leukemia reveals recurrent structural alterations and age-specific mutational interactions. Nat Med 2018;24:103–12. - PMC - PubMed
    1. Cancer Genome Atlas Research Network Ley TJ, Miller C, Ding L et al. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 2013;368:2059–74. - PMC - PubMed

LinkOut - more resources