Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 12;6(1):e00917-20.
doi: 10.1128/mSystems.00917-20.

FADU: a Quantification Tool for Prokaryotic Transcriptomic Analyses

Affiliations

FADU: a Quantification Tool for Prokaryotic Transcriptomic Analyses

Matthew Chung et al. mSystems. .

Abstract

Quantification tools for RNA sequencing (RNA-Seq) analyses are often designed and tested using human transcriptomics data sets, in which full-length transcript sequences are well annotated. For prokaryotic transcriptomics experiments, full-length transcript sequences are seldom known, and coding sequences must instead be used for quantification steps in RNA-Seq analyses. However, operons confound accurate quantification of coding sequences since a single transcript does not necessarily equate to a single gene. Here, we introduce FADU (Feature Aggregate Depth Utility), a quantification tool designed specifically for prokaryotic RNA-Seq analyses. FADU assigns partial count values proportional to the length of the fragment overlapping the target feature. To assess the ability of FADU to quantify genes in prokaryotic transcriptomics analyses, we compared its performance to those of eXpress, featureCounts, HTSeq, kallisto, and Salmon across three paired-end read data sets of (i) Ehrlichia chaffeensis, (ii) Escherichia coli, and (iii) the Wolbachia endosymbiont wBm. Across each of the three data sets, we find that FADU can more accurately quantify operonic genes by deriving proportional counts for multigene fragments within operons. FADU is available at https://github.com/IGS/FADUIMPORTANCE Most currently available quantification tools for transcriptomics analyses have been designed for human data sets, in which full-length transcript sequences, including the untranslated regions, are well annotated. In most prokaryotic systems, full-length transcript sequences have yet to be characterized, leading to prokaryotic transcriptomics analyses being performed based on only the coding sequences. In contrast to eukaryotes, prokaryotes contain polycistronic transcripts, and when genes are quantified based on coding sequences instead of transcript sequences, this leads to an increased abundance of improperly assigned ambiguous multigene fragments, specifically those mapping to multiple genes in operons. Here, we describe FADU, a quantification tool for prokaryotic RNA-Seq analyses designed to assign proportional counts with the purpose of better quantifying operonic genes while minimizing the pitfalls associated with improperly assigning fragment counts from ambiguous transcripts.

Keywords: bacteria; differential expression; operon; polycistronic transcripts; read count; software; transcriptome; transcriptomics.

PubMed Disclaimer

Figures

FIG 1
FIG 1
Implementation of FADU. (A) The workflow of FADU uses a BAM file and a GFF annotation file to identify proportional read counts for prokaryotic RNA-Seq analyses. (B) The implementation of FADU differs from those of other similar genome alignment-based quantification tools primarily in the quantification of ambiguous multigene fragments. The three sets of pie charts above a paired-end fragment display how the counts from the fragment are proportionally assigned to its overlapping genes. In the case of two overlapping genes, FADU accounts for only the unique portions of each gene and assigns a proportional count based on the length that the fragment overlaps the feature. (C) In the case of operons, FADU will assign proportional counts to the different genes based on the overlap between the mapping coordinates of the fragment and any overlapping genic features.
FIG 2
FIG 2
Performance of FADU in a simulated differential expression analysis. A 100-bootstrap dendrogram was generated using the counts obtained from 9 different RNA-Seq quantification methods on a simulated E. coli data set. The points at the edges of the dendrogram are colored to represent the different tools corresponding to each method. The colored bar at the edge of the dendrogram represents the different clusters of quantification tools.
FIG 3
FIG 3
Accuracy of FADU in a simulated RNA-Seq data set. (A) For each quantification method, a log2 ratio was calculated for the counts obtained for each gene in the simulated data set versus the actual counts expected from the simulated data set. The distributions for each of these methods would ideally be normal and centered at zero. (B) Zoomed-in version of the distribution generated from the log2 ratios. (C) The interquartile ranges for each of the distributions show the precision of each method.
FIG 4
FIG 4
MA plots comparing wBm fragment counts obtained using FADU against those obtained using other quantification tools. The fragment counts for the stranded wBm data set obtained using FADU were compared to the counts obtained using methods representative of the different clusters of quantification methods. The x axis denotes the mean average from the two compared counts (A), while the y axis denotes the log2 ratio of the two compared counts (M) as a scatterplot (left) and a density plot (right). The horizontal orange dotted lines on each plot are drawn at log2 ratio values of 2 and −2. Points with a log2 ratio greater than 2 and less than −2 were defined as genes counted differently by FADU relative to its counterpart tool.
FIG 5
FIG 5
Performance of quantification tools for deriving counts for operons. (A) Using the wBm stranded RNA-Seq data set, the log2 read depth (orange) was plotted for part of an operon containing 11 genes. Genes labeled and marked in blue are all significantly undercounted (log2 count ratio of less than −1), as assessed in Fig. 4. (B) For each quantification tool, the read depth per base pair was calculated for all 11 genes displayed and divided by the median read depth per base pair across the operon for each quantification mode to obtain a normalized relative count value. The log2-transformed normalized relative count values are displayed in the individual cells of the table. -Inf is used to denote when the quantification tool returned “0” for the gene such that the ratio cannot be log transformed. Because the 11 genes are transcribed together, we would expect the normalized values obtained for each of the 11 genes to be ∼0. Normalized values in red cells indicate that the gene has a higher count value than the other operonic genes, while blue cells indicate that the gene has a lower count value than the other operonic genes. Tools that discard ambiguous reads spanning two features in close proximity have a tendency to undercount the smaller genes in operons, such as Wbm7023, Wbm7024, and Wbm7025.
FIG 6
FIG 6
Improper quantification of fragments that span multiple features. (A) Using the wBm stranded RNA-Seq data set, the log2 read depth (orange) was plotted across the wBm gene Wbm7023 and its adjacent genes Wbm0608 and Wbm0609. (B) The counts for the three genes as determined by each of the quantification methods were calculated. The counts assigned to Wbm7023 are all derived from fragments that also map to the 3′ end of Wbm0608. Despite the wBm annotation lacking UTRs, these reads likely originate from the 3′ UTR of Wbm0608, indicating that most if not all the reads assigned to Wbm7023 are erroneous.
FIG 7
FIG 7
FADU timing and memory benchmarking. For the unstranded E. coli and E. chaffeensis and the stranded wBm data sets, the wall clock speed for the indexing and/or alignment steps and quantification steps using 4 threads (A) along with the maximum memory (max Vmem) used (green) (B) were recorded for all quantification methods analyzed.

References

    1. Liao Y, Smyth GK, Shi W. 2014. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30:923–930. doi:10.1093/bioinformatics/btt656. - DOI - PubMed
    1. Anders S, Pyl PT, Huber W. 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31:166–169. doi:10.1093/bioinformatics/btu638. - DOI - PMC - PubMed
    1. Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25. doi:10.1186/gb-2009-10-3-r25. - DOI - PMC - PubMed
    1. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357–359. doi:10.1038/nmeth.1923. - DOI - PMC - PubMed
    1. Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760. doi:10.1093/bioinformatics/btp324. - DOI - PMC - PubMed

LinkOut - more resources