Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Mar 13:14:997383.
doi: 10.3389/fgene.2023.997383. eCollection 2023.

RNA-seq data science: From raw data to effective interpretation

Affiliations
Review

RNA-seq data science: From raw data to effective interpretation

Dhrithi Deshpande et al. Front Genet. .

Abstract

RNA sequencing (RNA-seq) has become an exemplary technology in modern biology and clinical science. Its immense popularity is due in large part to the continuous efforts of the bioinformatics community to develop accurate and scalable computational tools to analyze the enormous amounts of transcriptomic data that it produces. RNA-seq analysis enables genes and their corresponding transcripts to be probed for a variety of purposes, such as detecting novel exons or whole transcripts, assessing expression of genes and alternative transcripts, and studying alternative splicing structure. It can be a challenge, however, to obtain meaningful biological signals from raw RNA-seq data because of the enormous scale of the data as well as the inherent limitations of different sequencing technologies, such as amplification bias or biases of library preparation. The need to overcome these technical challenges has pushed the rapid development of novel computational tools, which have evolved and diversified in accordance with technological advancements, leading to the current myriad of RNA-seq tools. These tools, combined with the diverse computational skill sets of biomedical researchers, help to unlock the full potential of RNA-seq. The purpose of this review is to explain basic concepts in the computational analysis of RNA-seq data and define discipline-specific jargon.

Keywords: RNA sequencing; bioinformatics; differential gene expression; high throughput sequencing; read alignment; transcriptome quantification.

PubMed Disclaimer

Conflict of interest statement

ET was employed by the company Pacific Biosciences (United States). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Overview of RNA-seq. RNA-seq is a process of creating short sequencing reads from RNA molecules. The steps consist of first converting the RNA (A) into cDNA (B), then (optionally) amplifying the cDNA by PCR (C), and finally fragmenting the cDNA into short pieces (known as fragments). After the sequencing library (D) is prepared, the fragments are used as input for next-generation sequencing (E). The resulting sequence reads contained in FASTQ files are then aligned to a reference sequence (F). Modern high-throughput sequencing machines can generate up to 150 million reads per run. The reference sequence, shown as a pink line, is known. The goal of the alignment is to find the locus in the reference sequence with the greatest match to each read. Reads are shown to align to the specific positions/locations and these mapped locations are recorded.
FIGURE 2
FIGURE 2
Alternative splicing and RNA-seq technologies. The flow of genetic information begins with DNA, which consists of introns and exons. DNA is transcribed into pre-mRNA and then further processed into mature mRNA by splicing out the introns and leaving the exons glued together. The mRNA is then translated into a protein. Transcripts with different arrangements of exons can be formed in a process called alternative splicing or exon skipping. An RNA-seq read is a short sequence sampled from a transcript. Reads are generated using sequencing technologies such as (A) the Illumina platform, which produces short reads, and the (B) Nanopore and PacBio platforms, which produce long reads. The figure depicts two scenarios in which uniquely mapped reads are aligned to a reference transcriptome (C) and a reference genome (D), respectively. A few of the reads are multicolored, indicating that when aligned, they span across an exon-exon junction. Some of the shorter reads (single-colored) are aligned only to a single exon and do not span across the junction. TSS, transcription start site; TES, transcription end site.
FIGURE 3
FIGURE 3
Measuring allele-specific expression with RNA-seq. RNA-seq can be used to generate allele-specific expression (ASE) data for genes with a heterozygous single-nucleotide polymorphism in the transcribed region (aseSNP). The aseSNP enables sequencing reads to be mapped to the haplotype from which they originate. Imbalance in ASE data is a functional indicator of a cis-regulatory difference between the two haplotypes that is driven by heterozygous regulatory variants. Data from multiple aseSNPs can be aggregated to improve ASE data quality. The non-coding regulatory variant depicted here has two alleles inducing higher (H) and lower (L) expression of the target gene.

References

    1. Abate F., Zairis S., Ficarra E., Acquaviva A., Wiggins C. H., Frattini V., et al. (2014). Pegasus: A comprehensive annotation and prediction tool for detection of driver gene fusions in cancer. BMC Syst. Biol. 8, 97. 10.1186/s12918-014-0097-z - DOI - PMC - PubMed
    1. Alser M., Rotman J., Deshpande D., Taraszka K., Shi H., Baykal P. I., et al. (2020). Technology dictates algorithms: Recent developments in read alignment. Genome Biol. 22, 249. 10.1186/s13059-021-02443-7 - DOI - PMC - PubMed
    1. Amarasinghe S. L., Su S., Dong X., Zappia L., Ritchie M. E., Gouil Q. (2020). Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30. 10.1186/s13059-020-1935-5 - DOI - PMC - PubMed
    1. Anders S., Pyl P. T., Huber W. (2015). HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169. 10.1093/bioinformatics/btu638 - DOI - PMC - PubMed
    1. Berbers B., Saltykova A., Garcia-Graells C., Philipp P., Arella F., Marchal K., et al. (2020). Combining short and long read sequencing to characterize antimicrobial resistance genes on plasmids applied to an unauthorized genetically modified Bacillus. Sci. Rep. 10, 4310. 10.1038/s41598-020-61158-0 - DOI - PMC - PubMed

LinkOut - more resources