Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 24;25(Suppl 2):335.
doi: 10.1186/s12859-024-05924-1.

DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms

Affiliations

DNA-protein quasi-mapping for rapid differential gene expression analysis in non-model organisms

Kyle Christian L Santiago et al. BMC Bioinformatics. .

Abstract

Background: Conventional differential gene expression analysis pipelines for non-model organisms require computationally expensive transcriptome assembly. We recently proposed an alternative strategy of directly aligning RNA-seq reads to a protein database, and demonstrated drastic improvements in speed, memory usage, and accuracy in identifying differentially expressed genes.

Result: Here we report a further speed-up by replacing DNA-protein alignment by quasi-mapping, making our pipeline > 1000× faster than assembly-based approach, and still more accurate. We also compare quasi-mapping to other mapping techniques, and show that it is faster but at the cost of sensitivity.

Conclusion: We provide a quick-and-dirty differential gene expression analysis pipeline for non-model organisms without a reference transcriptome, which directly quasi-maps RNA-seq reads to a reference protein database, avoiding computationally expensive transcriptome assembly.

Keywords: DNA-protein alignment; Differential gene expression analysis; Non-model organism; Quasi-mapping; RNA-seq.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests

Figures

Fig. 1
Fig. 1
Example of the reference index data-structures for the set of sequences MVVAV,VAVNV,VAAVV. C is the concatenated string, B is the bit-vector, SA suffix array, and HT is a hash table mapping a 2-mer to a suffix array interval containing suffixes whose prefix is the 2-mer
Algorithm 1
Algorithm 1
Quasi-mapping of one translation of a read. This process is repeated for all possible 6 frames.
Fig. 2
Fig. 2
Mapping performance of the different aligners/mappers when using different reference proteomes. For DIAMOND, different points correspond to the different available presets: normal, fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, and ultra-sensitive. For Kaiju, the points correspond to setting the mode to greedy and MEM. For LAST, the maximum mismap probability was set to 0.95,0.9, and 0.8. For our method, we used the k-mer size of 7 and coverage threshold was varied among 40, 50, and 60
Fig. 3
Fig. 3
Comparison of mapping run times for a typical sample containing roughly 21 million pairs reads of length 100 bp each. Each tool was run on a single thread and with the following settings: normal for DIAMOND, greedy for Kaiju, maximum mismap probability of 0.95 for LAST, and k-mer 7 with coverage threshold 40 for our method
Fig. 4
Fig. 4
Precision-recall curves for differential gene expression analysis using different tools for the mapping step and when using different reference proteomes. The three shape markers in each curve correspond to setting the false discovery rates in DESeq2 to the values of 0.01, 0.05, and 0.1
Fig. 5
Fig. 5
Comparing the performance of our method on the full amino acid alphabet versus a reduced one of size 11. For each curve, the three points from left to right correspond to coverage thresholds of 60, 50, and 40, respectively

Similar articles

References

    1. Vijay N, Poelstra JW, Künstner A, Wolf JBW. Challenges and strategies in transcriptome assembly and differential gene expression quantification a. comprehensive in-silico assessment of RNA-seq experiments. Mol Ecol. 2012;22(3):620–34. - PubMed
    1. Hsieh P-H, Oyang Y-J, Chen C-Y. Effect of de novo transcriptome assembly on transcript quantification. Sci Rep. 2019;9(1):8304. - PMC - PubMed
    1. Shrestha AMS, Guiao JEB, Santiago KCL. Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment. BMC Genom. 2022;23(1):97. - PMC - PubMed
    1. Liu P, Ewald J, Galvez JH, Head J, Crump D, Bourque G, Basu N, Xia J. Ultrafast functional profiling of RNA-seq data for nonmodel organisms. Genome Res. 2021;31(4):713–20. - PMC - PubMed
    1. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14(4):417–9. - PMC - PubMed

LinkOut - more resources