A bioinformatic-assisted workflow for genome-wide identification of ncRNAs

Matthias Schmal¹, Crystal Girod², Debbie Yaver², Robert L Mach³, Astrid R Mach-Aigner¹

Affiliations

¹ Christian Doppler laboratory for optimized expression of carbohydrate-active enzymes, Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Str. 1A, Vienna A-1060, Austria.
² Production Strain Technology, Novozymes Inc., California, Davis, USA.
³ Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Str. 1A, Vienna A-1060, Austria.

PMID: 35979446
PMCID: PMC9376865
DOI: 10.1093/nargab/lqac059

A bioinformatic-assisted workflow for genome-wide identification of ncRNAs

Matthias Schmal et al. NAR Genom Bioinform. 2022.

. 2022 Aug 15;4(3):lqac059.

doi: 10.1093/nargab/lqac059. eCollection 2022 Sep.

Authors

Matthias Schmal¹, Crystal Girod², Debbie Yaver², Robert L Mach³, Astrid R Mach-Aigner¹

Affiliations

¹ Christian Doppler laboratory for optimized expression of carbohydrate-active enzymes, Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Str. 1A, Vienna A-1060, Austria.
² Production Strain Technology, Novozymes Inc., California, Davis, USA.
³ Institute of Chemical, Environmental and Bioscience Engineering, TU Wien, Gumpendorfer Str. 1A, Vienna A-1060, Austria.

PMID: 35979446
PMCID: PMC9376865
DOI: 10.1093/nargab/lqac059

Abstract

With the upcoming of affordable Next-Generation Sequencing technologies, the number of known non-protein coding RNAs increased drastically in recent years. Different types of non-coding RNAs (ncRNAs) emerged as key players in the regulation of gene expression on the RNA-RNA, RNA-DNA as well as RNA-protein level, ranging from involvement in chromatin remodeling and transcription regulation to post-transcriptional modifications. Prediction of ncRNAs involves the use of several bioinformatics tools and can be a daunting task for researchers. This led to the development of analysis pipelines such as UClncR and lncpipe. However, these pipelines are limited to datasets from human, mouse, zebrafish or fruit fly and are not able to analyze RNA sequencing data from other organisms. In this study, we developed the analysis pipeline Pinc (Pipeline for prediction of ncRNA) as an enhanced tool to predict ncRNAs based on sequencing data by removing transcripts that show protein-coding potential. Additionally, a feature for differential expression analysis of annotated genes as well as for identification of novel ncRNAs is implemented. Pinc uses Nextflow as a framework and is built with robust and well-established analysis tools. This will allow researchers to utilize sequencing data from every organism in order to reliably identify ncRNAs.

PubMed Disclaimer

Figures

**Figure 1.**
Overview on the process of generation of training data for the test runs. (A) Coding sequences (CDS; blue boxes) of *H. sapiens*, *A. thaliana* and *S. cerevisiae* were taken directly from RefSeq. Sufficient data for ncRNAs are available for human and thale cress; whereas the dataset of S. cerevisiae is too small to train CPAT solely on. Therefore, all RefSeq entries of ncRNAs of the phylum ascomycota were used as non-coding training data for CPAT. CDS were split. Right part: a portion of the CDS was used to contaminate the set of ncRNAs (white box) with coding RNAs in a ratio of 5:1 (blue, hatched box). This was done to simulate not annotated coding transcripts, which might still be in the dataset after filtering out all annotated CDS of the genome. CPC2 was used to predict ncRNAs within this mixed RNA pool. 80% of the sequences predicted as non-coding were used as non-coding training set (white box) for CPAT and 20% as the non-coding test set. Left part: Remaining CDS were split again: 80% for the coding training set, 20% for the coding test set. (B) The training datasets of ncRNAs (white and hatched boxes) CDS (blue box) were combined in a ratio of 1:1. 10-fold stratified cross-validation (CV) was used to calculate the model-specific ‘optimal’ cut-off. In each iteration the weighted Youden's index was used to calculate the cut-off. The mean of cut-offs from all 10 iterations is used to predict ncRNAs based on their coding probability calculated by CPAT.

**Figure 2.**
Overview of the process of generation of training data used in Pinc. (A) Right part: The filtered transfrags (white box) were subjected to CPC2 prediction of ncRNAs. 80% of the sequences predicted as non-coding were used as non-coding training set for CPAT and 20% as the non-coding test set. Left part: Coding sequences (CDS; blue boxes) are taken from the provided genome annotation and were split: 80% for the coding training set, 20% for the coding test set. (B) The training set that consists of ncRNAs (white box) was combined with the CDS training set (blue box) in a ratio of 1:1. 10-fold stratified cross-validation (CV) is used to calculate the model-specific, ‘optimal’ cut-off. In each iteration the weighted Youden's index was used to calculate the cut-off. The mean of cut-offs from all 10 iterations is used to predict ncRNAs based on their coding probability calculated by CPAT.

**Figure 3.**
Graphical overview on Pinc. Raw Sequencing reads are filtered based on quality and length using fastp. Subsequently, HISAT2 aligns the reads against the reference genome. StringTie assembles aligned reads into transfrags. Transfrags of already annotated features are removed by filtering for putative novel ncRNAs based on gffcompare's transfrag classification code. Together with the protein-coding RNAs from the reference annotation an organism-specific model is trained using CPC2 and CPAT to assess the coding probability of all putative, novel, non-coding transfrags. As edgeR requires the total count of reads mapped to each transfrag for a differential expression analysis, HTSeq-count was used to count the reads.

See this image and copyright information in PMC

Cited by

AI-powered precision medicine: utilizing genetic risk factor optimization to revolutionize healthcare.
Alsaedi S, Ogasawara M, Alarawi M, Gao X, Gojobori T. Alsaedi S, et al. NAR Genom Bioinform. 2025 May 5;7(2):lqaf038. doi: 10.1093/nargab/lqaf038. eCollection 2025 Jun. NAR Genom Bioinform. 2025. PMID: 40330081 Free PMC article. Review.
The expression landscape and pangenome of long non-coding RNA in the fungal wheat pathogen Zymoseptoria tritici.
Glad HM, Tralamazza SM, Croll D. Glad HM, et al. Microb Genom. 2023 Nov;9(11):001136. doi: 10.1099/mgen.0.001136. Microb Genom. 2023. PMID: 37991492 Free PMC article.

References

1. Christov C.P., Gardiner T.J., Szüts D., Krude T.. Functional requirement of noncoding y RNAs for human chromosomal DNA replication. Mol. Cell. Biol. 2006; 26:6993–7004. - PMC - PubMed
1. Statello L., Guo C.-J., Chen L.-L., Huarte M.. Gene regulation by long non-coding RNAs and its biological functions. Nat. Rev. Mol. Cell Biol. 2021; 22:96–118. - PMC - PubMed
1. Sun Z., Nair A., Chen X., Prodduturi N., Wang J., Kocher J.-P.. UClncR: ultrafast and comprehensive long non-coding RNA detection from RNA-seq. Sci. Rep. 2017; 7:14196. - PMC - PubMed
1. Zhao Q., Sun Y., Wang D., Zhang H., Yu K., Zheng J., Zuo Z.. LncPipe: a Nextflow-based pipeline for identification and analysis of long non-coding RNAs from RNA-Seq data. J. Genet. Genomics. 2018; 45:399–401. - PubMed
1. Chen S., Zhou Y., Chen Y., Gu J.. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018; 34:i884–i890. - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A bioinformatic-assisted workflow for genome-wide identification of ncRNAs

Affiliations

A bioinformatic-assisted workflow for genome-wide identification of ncRNAs

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources