Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 4:10:1.
doi: 10.12688/f1000research.27868.3. eCollection 2021.

WIND (Workflow for pIRNAs aNd beyonD): a strategy for in-depth analysis of small RNA-seq data

Affiliations

WIND (Workflow for pIRNAs aNd beyonD): a strategy for in-depth analysis of small RNA-seq data

Konstantinos Geles et al. F1000Res. .

Abstract

Current bioinformatics workflows for PIWI-interacting RNA (piRNA) analysis focus primarily on germline-derived piRNAs and piRNA-clusters. Frequently, they suffer from outdated piRNA databases, questionable quantification methods, and lack of reproducibility. Often, pipelines specific to miRNA analysis are used for the piRNA research in silico. Furthermore, the absence of a well-established database for piRNA annotation, as for miRNA, leads to uniformity issues between studies and generates confusion for data analysts and biologists. For these reasons, we have developed WIND ( Workflow for p IRNAs a Nd beyon D), a bioinformatics workflow that addresses the crucial issue of piRNA annotation, thereby allowing a reliable analysis of small RNA sequencing data for the identification of piRNAs and other small non-coding RNAs (sncRNAs) that in the past have been incorrectly classified as piRNAs. WIND allows the creation of a comprehensive annotation track of sncRNAs combining information available in RNAcentral, with piRNA sequences from piRNABank, the first database dedicated to piRNA annotation. WIND was built with Docker containers for reproducibility and integrates widely used bioinformatics tools for sequence alignment and quantification. In addition, it includes Bioconductor packages for exploratory data and differential expression analysis. Moreover, WIND implements a "dual" approach for the evaluation of sncRNAs expression level quantifying the aligned reads to the annotated genome and carrying out an alignment-free transcript quantification using reads mapped to the transcriptome. Therefore, a broader range of piRNAs can be annotated, improving their quantification and easing the subsequent downstream analysis. WIND performance has been tested with several small RNA-seq datasets, demonstrating how our approach can be a useful and comprehensive resource to analyse piRNAs and other classes of sncRNAs.

Keywords: ncRNA-expression; piRNA; small RNA sequencing; workflow.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Workflow schematic representation.
The Annotation forging step, represented in blue, is the creation of a GTF file, where the two input databases (piRNABank and RNAcentral) are merged to produce the new small RNA annotation track, that together with the Fasta files constitute the inputs of the following step. In Pre-processing & Quantification step (light blue area), the user's fastq files undergo through the quality check, and the adapter removal followed by the two quantification approaches (completed by Salmon, and STAR with FeatureCounts software) that perform in parallel alignment and the quantification of reads. In the green box, representing the Exploratory data analysis phase, are displayed all the possible results produced by the workflow. The data analyst could also pursue differential expression analysis if that is the desirable outcome.
Figure 2.
Figure 2.. Example of plots generated by WIND.
A) and B) Biodetection plots (genomic approach) from NOIseq reporting: percentages of each sncRNA type called "biotype" on the genome (grey bar) for one of the samples; the proportion detected in each sample (red stripes bar); the percentage of each biotype within the sample (red bar). The biotypes on the right side of the green dashed line are the least abundant, and the reference values are reported on the Y right axis. C) and D) CountsBio plots (genomic approach) from NOIseq showing the count distribution for each biotype displayed as boxplots. Numbers on top of the plot show how many sncRNAs are detected per biotype in the entire dataset analysed. Different colours indicate different sncRNA classes. E) and F) Sequence Logo (1-15 bps) extracted from the piRNA sequence of the expressed piRNAs found in each group of samples (transcriptomic approach). A, C, E represents the results obtained for one representative testis samples, while B, D, F represent one representative COLO 205 sample.
Figure 3.
Figure 3.. Exploratory data analysis plots generated by WIND.
AB) Histograms of average log 2 Counts Per Million (CPM) among all samples before ( A) and after ( B) filtering with one of the selected methods (EdgeR filtering in this case) for sncRNA data. CD) Relative Log Expression (RLE) plots for each normalisation method, made with the use of plotRLE function for all the sncRNA data. As an example, only the first two plots (with TMM ( D) and without normalisation ( C) for the filtered counts derived from FeatureCounts) are shown. EF) Hierarchical Clustering plots, exploiting all the sequenced sncRNA data, with multiple clustering methods and different normalisation methods. As an example, only the first two plots (with TMM ( D) and without normalisation ( C) for the filtered counts derived from FeatureCounts). In black and brown are shown the two different groups (monolayer and spheroid).
Figure 4.
Figure 4.. Sample group clustering plots.
A) Correlation plot showing samples' distances in GSE68246 dataset. The darker the colour, the more correlated they are. B) Multidimensional Scaling (MDS) plot using all the sequenced data and one of the normalisation methods applied in the workflow (in this case, TMM) made with plotMDS() function from EdgeR. In black and brown are shown the two different groups (monolayer and spheroid). C) Principal Components Analysis (PCA) plot displaying the first two Principal Components using all the sncRNA molecules data. Each sample is shown with different colours (depending on the group) and different symbols (depending on the batch).
Figure 5.
Figure 5.. Barplots of the length of piRNA classes with respect to each experimental group (in this case monolayer and spheroid MCF7).
The colours indicate the two different methods of quantification (genomic and transcriptomic).
Figure 6.
Figure 6.. Differential expression analysis.
A) Heatmap of differentially expressed piRNAs in 3 MCF7 Spheroid samples versus 3 MCF7 Monolayer (GSE68246 public dataset) found in common with both approaches (genomic and transcriptomic). B) Heatmap of differentially expressed piRNAs among 9 Primary Solid Tumour versus 9 Solid Tissue Normal from TCGA found in common with both approaches (genomic and transcriptomic).

References

    1. Duarte Junior FF, Bueno PSA, Pedersen SL, et al. : Identification and characterization of stem-bulge RNAs in Drosophila melanogaster. RNA Biol. 2019;16(3):330–339. 10.1080/15476286.2019.1572439 - DOI - PMC - PubMed
    1. Jackowiak P, Lis A, Luczak M, et al. : Functional characterization of RNA fragments using high-throughput interactome screening. J Proteomics. 2019;193:173–183. 10.1016/j.jprot.2018.10.007 - DOI - PubMed
    1. Romano G, Veneziano D, Acunzo M, et al. : Small non-coding RNA and cancer. Carcinogenesis. 2017;38(5):485–491. 10.1093/carcin/bgx026 - DOI - PMC - PubMed
    1. Weick EM, Miska EA: piRNAs: from biogenesis to function. Development. 2014;141(18):3458–71. 10.1242/dev.094037 - DOI - PubMed
    1. Ozata DM, Gainetdinov I, Zoch A, et al. : PIWI-interacting RNAs: small RNAs with big functions. Nat Rev Genet. 2019;20(2):89–108. 10.1038/s41576-018-0073-3 - DOI - PubMed

Publication types

Substances

LinkOut - more resources