This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Nov 27:2023.09.15.558026.

doi: 10.1101/2023.09.15.558026.

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Yizhou Peter Huang^{1

2}, Lauren Harmon², Eve Deering-Gardner², Xiaotu Ma³, Josiah Harsh², Zhaoyu Xue², Hong Wen², Marcel Ramos⁴, Sean Davis⁵, Timothy J Triche Jr^{2

1}

Affiliations

¹ Michigan State University, East Lansing, MI, US.
² Van Andel Institute, Grand Rapids, MI, US.
³ St. Jude Children's Research Hospital, Memphis, TN, US.
⁴ Roswell Park Cancer Institute, Bufaflo, NY, US.
⁵ University of Colorado Anschutz Medical Campus, Aurora, CO, US.

PMID: 37745420
PMCID: PMC10516001
DOI: 10.1101/2023.09.15.558026

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Yizhou Peter Huang et al. bioRxiv. 2024.

[Preprint]. 2024 Nov 27:2023.09.15.558026.

doi: 10.1101/2023.09.15.558026.

Authors

Yizhou Peter Huang^{1

2}, Lauren Harmon², Eve Deering-Gardner², Xiaotu Ma³, Josiah Harsh², Zhaoyu Xue², Hong Wen², Marcel Ramos⁴, Sean Davis⁵, Timothy J Triche Jr^{2

1}

Affiliations

¹ Michigan State University, East Lansing, MI, US.
² Van Andel Institute, Grand Rapids, MI, US.
³ St. Jude Children's Research Hospital, Memphis, TN, US.
⁴ Roswell Park Cancer Institute, Bufaflo, NY, US.
⁵ University of Colorado Anschutz Medical Campus, Aurora, CO, US.

PMID: 37745420
PMCID: PMC10516001
DOI: 10.1101/2023.09.15.558026

Update in

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis.
Huang YP, Harmon L, Deering-Gardner E, Ma X, Harsh J, Xue Z, Wen H, Ramos M, Davis S, Triche TJ Jr. Huang YP, et al. Bioinform Adv. 2025 Apr 28;5(1):vbaf098. doi: 10.1093/bioadv/vbaf098. eCollection 2025. Bioinform Adv. 2025. PMID: 40395503 Free PMC article.

Abstract

The NCI Genomic Data Commons (GDC) provides controlled access to sequencing data from thousands of subjects, enabling large-scale study of impactful genetic alterations such as simple and complex germline and structural variants. However, efficient analysis requires significant computational resources and expertise, especially when recalling variants from raw sequence reads. We thus developed bamSliceR , an R/Bioconductor package that builds upon the GenomicDataCommons package to extract aligned sequence reads from cross-GDC meta-cohorts, followed by targeted analysis of variants and effects (including transcript-aware variant annotation from transcriptome-aligned GDC RNA data). Here we demonstrate population-scale genomic & transcriptomic analyses with minimal compute burden via bamSliceR , identifying recurrent, clinically relevant sequence and structural variants in the TARGET AML and BEAT-AML cohorts. We then validate results in the (non-GDC) Leucegene cohort, demonstrating how the bamSliceR pipeline can be seamlessly applied to replicate findings in non-GDC cohorts. These variants directly yield clinically impactful and biologically testable hypotheses for mechanistic investigation. bamSliceR has been submitted to the Bioconductor project, where it is presently under review, and is available on GitHub at https://github.com/trichelab/bamSliceR.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest None declared.

Figures

**Figure 1.**
**A) Overview of the *bamSliceR* Scheme**. A. This schematic illustrates the bamSliceR pipeline, designed to efficiently retrieve metrics of variants from target regions across various data types, including DNA-Seq genomic BAMs, RNA-Seq genomic BAMs, and RNA-Seq transcriptome BAMs. **B) *bamSliceR* utility to retrieve metrics of variants fromRNA-Seq transcriptome BAMs.**

**Figure 2.**
*bamSliceR* Workflow and Functionality for querying, downloading, tallying and annotating variants from BAM files. A) Identify Data Available via GDC Suitable for your Query. B) Download Sliced-BAM files from GDC. C) Variants Tallying & Annotation.

**Figure 3.. Scheme to Calculate Transcript Coordinates of Genomic Features in GFF3 file.**

**Figure 4.. Schematic for Determining Equivalence Classes of Transcripts for INDEL variants.**
To identify equivalence classes of transcripts for INDEL variants, the process beings by disjoining exon regions for each gene in the GFF3 file based on their genomic coordinates. For instance, a single exon may be split into multiple disjoint bins. In the provided example, the first exon generates three bins: Bin1 [T1, T2], Bin2 [T1, T2, T3], and Bin3 [T2, T3]. For each bin, the genomic and transcript-level coordinates of both the bin and the corresponding exon are tabulated (e.g., see the detailed example for Bin2). When analyzing an INDEL, all bins overlapping with the variant’s genomic coordinates are identified. Finally, the transcripts containing the INDEL are determined by intersecting the sets of transcripts associated with the overlapping bins.

**Figure 5.. bamSliceR is designed to significantly reduce the space required for analyzing BAM files by slicing them at specified genomic regions**

**Figure 6.. Detection of UBTF-ITD in the Leucegene Project**
Top) Plot of soft-clip read counts (sc-counts) within chr17:44210679–44211356 (GRCh38) of 452 patients in the Leucegene Project. Two patients (04H039 and 07H019) show high soft-clipped read counts at hotspots of UBTF-ITD. 04H039 has 57, 190, 67 sc-counts at chr17:44210812, chr17:44210816 and chr17:44210853 respectively (Highlight in Blue dots). 07H019 has 161, 65, 66, 168 sc-counts at chr17:44210803, chr17:44210806, chr17:44210841, and chr17:44210849, respectively (Highlight in Yellow dots). Manual examination of RNA-seq BAM files of 18 patients with a maximum sc-counts > 25 within the ranges show no UBTF-ITD. Bottom) Integrative Genomics Viewer (IGV) visualization showing UBTF-ITD with soft-clipped reads and increased coverage in UBTF exon 13 in patients 04H039 and 07H019.

See this image and copyright information in PMC

References

1. Schwartzentruber J. et al. Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma. Nature 482, 226–231 (2012). - PubMed
1. Mayakonda A., Lin D.-C., Assenov Y., Plass C. & Koeffler H. P. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 28, 1747–1756 (2018). - PMC - PubMed
1. Obenchain V. et al. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076–2078 (2014). - PMC - PubMed
1. McLaren W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016). - PMC - PubMed
1. Tian L. et al. CICERO: a versatile method for detecting complex and diverse driver fusions using cancer RNA sequencing data. Genome Biol. 21, 126 (2020). - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

This is a preprint.

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Affiliations

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Authors

Affiliations

Update in

Abstract

Conflict of interest statement

Figures

References

Publication types

Grants and funding

LinkOut - more resources

Full Text Sources