Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 26;41(1):btae751.
doi: 10.1093/bioinformatics/btae751.

Rnalib: a Python library for custom transcriptomics analyses

Affiliations

Rnalib: a Python library for custom transcriptomics analyses

Niko Popitsch et al. Bioinformatics. .

Abstract

Motivation: The efficient and reproducible analysis of high-throughput sequencing datasets necessitates the development of methodical and robust computational pipelines that integrate established and bespoke bioinformatics analysis tools, often written in high-level programming languages such as Python. Despite the increasing availability of programming libraries for genomics, there is a noticeable lack of tools specifically focused on transcriptomics. Key tasks in this area include the association of gene features (e.g. transcript isoforms, introns or untranslated regions) with relevant subsections of (large) genomics datasets across diverse data formats, as well as efficient querying of these data based on genomic locations and annotation attributes.

Results: To address the needs of transcriptomics data analyses, we developed rnalib, a Python library designed for creating custom bioinformatics analysis methods. Built on existing Python libraries like pysam and pyBigWig, rnalib offers random access support, enabling efficient access to relevant subregions of large, genome-wide datasets. Rnalib extends the filtering and access capabilities of these libraries and includes additional checks to prevent common errors when integrating genomics datasets. The library is centred on an object-oriented Transcriptome class that provides methods for stepwise annotation of gene features with both, local and remote data sources. The rnalib Application Programming Interface cleanly separates immutable genomic locations from associated, mutable data, and offers a wide range of methods for iterating, querying, and exporting collated datasets. Rnalib establishes a fast, readable, reproducible, and robust framework for developing novel transcriptomics data analysis tools and methods.

Availability and implementation: Source code, documentation, and tutorials are available at https://github.com/popitsch/rnalib.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A typical application scenario for rnalib: (1) A transcriptome, object is instantiated from a filtered GFF3 gene annotation file. The hierarchical relationships between genes, transcripts, and sub-features (exons, introns, CDS, etc.) are explicitly modelled as Python objects. (2) Sequences for the instantiated gene intervals are loaded from a reference genome FASTA file and stored in an annotation dict. Sequences of sub-features are sliced from there on request. Thereby, rnalib avoids storing redundant information (i.e. explicit sub-feature sequences). The same mechanism can also be used for sliceable custom annotations. (3) Transcriptome features are sequentially annotated with genomics data using rnalib iterators that support a wide range of common data formats as well as any tabix-indexed data file. Gene annotations can directly be queried from MyGene.info which collates up-to-date annotation data from many large public databases including Ensembl, UniProt, or PharmGKB. (4) Users can conveniently access the data, e.g. via interactive Jupyter notebooks. Efficient querying via standard Python list comprehension or (location based) intervaltree queries is supported. (5) Finally, data can be exported in various formats for downstream processing. This includes BED/GFF3 files but also pandasDataFrames with customisable columns that can then be further processed, e.g. plotted with matplotlib/seaborn or analysed with bioframe.

Similar articles

References

    1. Abdennur N, Fudenberg G, Flyamer IM. et al.; Open2C. Bioframe: operations on genomic intervals in pandas dataframes. Bioinformatics 2024;40:btae088. 10.1093/bioinformatics/btae088 - DOI - PMC - PubMed
    1. Dale RK, Pedersen BS, Quinlan AR.. Pybedtools: a flexible python library for manipulating genomic datasets and annotations. Bioinformatics 2011;27:3423–4. - PMC - PubMed
    1. Danecek P, Bonfield JK, Liddle J. et al. Twelve years of SAMtools and BCFtools. GigaScience 2021;10:giab008. 10.1093/gigascience/giab008 - DOI - PMC - PubMed
    1. Harris CR, Millman KJ, van der Walt S. et al. Array programming with numpy. Nature 2020;585:357–62. - PMC - PubMed
    1. Heger A, Marshall J, Jacobs K. et al. Pysam: Htslib interface for python. 2009. https://github.com/pysam-developers/pysam (12 December 2024, date last accessed).

MeSH terms