Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Nov 27:2023.09.15.558026.
doi: 10.1101/2023.09.15.558026.

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Affiliations

bamSliceR: a Bioconductor package for rapid, cross-cohort variant and allelic bias analysis

Yizhou Peter Huang et al. bioRxiv. .

Update in

Abstract

The NCI Genomic Data Commons (GDC) provides controlled access to sequencing data from thousands of subjects, enabling large-scale study of impactful genetic alterations such as simple and complex germline and structural variants. However, efficient analysis requires significant computational resources and expertise, especially when recalling variants from raw sequence reads. We thus developed bamSliceR , an R/Bioconductor package that builds upon the GenomicDataCommons package to extract aligned sequence reads from cross-GDC meta-cohorts, followed by targeted analysis of variants and effects (including transcript-aware variant annotation from transcriptome-aligned GDC RNA data). Here we demonstrate population-scale genomic & transcriptomic analyses with minimal compute burden via bamSliceR , identifying recurrent, clinically relevant sequence and structural variants in the TARGET AML and BEAT-AML cohorts. We then validate results in the (non-GDC) Leucegene cohort, demonstrating how the bamSliceR pipeline can be seamlessly applied to replicate findings in non-GDC cohorts. These variants directly yield clinically impactful and biologically testable hypotheses for mechanistic investigation. bamSliceR has been submitted to the Bioconductor project, where it is presently under review, and is available on GitHub at https://github.com/trichelab/bamSliceR.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest None declared.

Figures

Figure 1.
Figure 1.
A) Overview of the bamSliceR Scheme. A. This schematic illustrates the bamSliceR pipeline, designed to efficiently retrieve metrics of variants from target regions across various data types, including DNA-Seq genomic BAMs, RNA-Seq genomic BAMs, and RNA-Seq transcriptome BAMs. B) bamSliceR utility to retrieve metrics of variants fromRNA-Seq transcriptome BAMs.
Figure 2.
Figure 2.
bamSliceR Workflow and Functionality for querying, downloading, tallying and annotating variants from BAM files. A) Identify Data Available via GDC Suitable for your Query. B) Download Sliced-BAM files from GDC. C) Variants Tallying & Annotation.
Figure 3.
Figure 3.. Scheme to Calculate Transcript Coordinates of Genomic Features in GFF3 file.
Figure 4.
Figure 4.. Schematic for Determining Equivalence Classes of Transcripts for INDEL variants.
To identify equivalence classes of transcripts for INDEL variants, the process beings by disjoining exon regions for each gene in the GFF3 file based on their genomic coordinates. For instance, a single exon may be split into multiple disjoint bins. In the provided example, the first exon generates three bins: Bin1 [T1, T2], Bin2 [T1, T2, T3], and Bin3 [T2, T3]. For each bin, the genomic and transcript-level coordinates of both the bin and the corresponding exon are tabulated (e.g., see the detailed example for Bin2). When analyzing an INDEL, all bins overlapping with the variant’s genomic coordinates are identified. Finally, the transcripts containing the INDEL are determined by intersecting the sets of transcripts associated with the overlapping bins.
Figure 5.
Figure 5.. bamSliceR is designed to significantly reduce the space required for analyzing BAM files by slicing them at specified genomic regions
Figure 6.
Figure 6.. Detection of UBTF-ITD in the Leucegene Project
Top) Plot of soft-clip read counts (sc-counts) within chr17:44210679–44211356 (GRCh38) of 452 patients in the Leucegene Project. Two patients (04H039 and 07H019) show high soft-clipped read counts at hotspots of UBTF-ITD. 04H039 has 57, 190, 67 sc-counts at chr17:44210812, chr17:44210816 and chr17:44210853 respectively (Highlight in Blue dots). 07H019 has 161, 65, 66, 168 sc-counts at chr17:44210803, chr17:44210806, chr17:44210841, and chr17:44210849, respectively (Highlight in Yellow dots). Manual examination of RNA-seq BAM files of 18 patients with a maximum sc-counts > 25 within the ranges show no UBTF-ITD. Bottom) Integrative Genomics Viewer (IGV) visualization showing UBTF-ITD with soft-clipped reads and increased coverage in UBTF exon 13 in patients 04H039 and 07H019.

Similar articles

References

    1. Schwartzentruber J. et al. Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma. Nature 482, 226–231 (2012). - PubMed
    1. Mayakonda A., Lin D.-C., Assenov Y., Plass C. & Koeffler H. P. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Res. 28, 1747–1756 (2018). - PMC - PubMed
    1. Obenchain V. et al. VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. Bioinformatics 30, 2076–2078 (2014). - PMC - PubMed
    1. McLaren W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016). - PMC - PubMed
    1. Tian L. et al. CICERO: a versatile method for detecting complex and diverse driver fusions using cancer RNA sequencing data. Genome Biol. 21, 126 (2020). - PMC - PubMed

Publication types