Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets
- PMID: 34179780
- PMCID: PMC8221386
- DOI: 10.1093/nargab/lqab058
Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets
Abstract
The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.
© The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
Figures





Similar articles
-
A survey of k-mer methods and applications in bioinformatics.Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38840832 Free PMC article. Review.
-
Transipedia.org: k-mer-based exploration of large RNA sequencing datasets and application to cancer data.Genome Biol. 2024 Oct 10;25(1):266. doi: 10.1186/s13059-024-03413-5. Genome Biol. 2024. PMID: 39390592 Free PMC article.
-
Integrating RNA-seq and ChIP-seq data to characterize long non-coding RNAs in Drosophila melanogaster.BMC Genomics. 2016 Mar 11;17:220. doi: 10.1186/s12864-016-2457-0. BMC Genomics. 2016. PMID: 26969372 Free PMC article.
-
Limitations of alignment-free tools in total RNA-seq quantification.BMC Genomics. 2018 Jul 3;19(1):510. doi: 10.1186/s12864-018-4869-5. BMC Genomics. 2018. PMID: 29969991 Free PMC article.
-
PLIT: An alignment-free computational tool for identification of long non-coding RNAs in plant transcriptomic datasets.Comput Biol Med. 2019 Feb;105:169-181. doi: 10.1016/j.compbiomed.2018.12.014. Epub 2019 Jan 4. Comput Biol Med. 2019. PMID: 30665012
Cited by
-
A survey of k-mer methods and applications in bioinformatics.Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38840832 Free PMC article. Review.
-
Effective requesting method to detect fusion transcripts in chronic myelomonocytic leukemia RNA-seq.NAR Genom Bioinform. 2024 Sep 24;6(3):lqae117. doi: 10.1093/nargab/lqae117. eCollection 2024 Sep. NAR Genom Bioinform. 2024. PMID: 39318504 Free PMC article.
-
Transipedia.org: k-mer-based exploration of large RNA sequencing datasets and application to cancer data.Genome Biol. 2024 Oct 10;25(1):266. doi: 10.1186/s13059-024-03413-5. Genome Biol. 2024. PMID: 39390592 Free PMC article.
-
A Framework for Comparison and Assessment of Synthetic RNA-Seq Data.Genes (Basel). 2022 Dec 14;13(12):2362. doi: 10.3390/genes13122362. Genes (Basel). 2022. PMID: 36553629 Free PMC article.
References
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous