. 2024 May;21(5):793-797.

doi: 10.1038/s41592-024-02229-2. Epub 2024 Mar 20.

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

Francisco J Pardo-Palacios^#^{1

2}, Angeles Arzalluz-Luque^#^{1

2}, Liudmyla Kondratova^{3

4}, Pedro Salguero², Jorge Mestre-Tomás¹, Rocío Amorín^{4

5}, Eva Estevan-Morió¹, Tianyuan Liu¹, Adalena Nanni⁶, Lauren McIntyre^{4

6}, Elizabeth Tseng⁷, Ana Conesa⁸

Affiliations

¹ Institute for Integrative Systems Biology, Spanish National Research Council, Paterna, Valencia, Spain.
² Department of Applied Statistics and Operational Research, and Quality, Universitat Politècnica de València, Valencia, Valencia, Spain.
³ Horticultural Sciences Department, University of Florida, Gainesville, FL, USA.
⁴ Genetics Institute, University of Florida, Gainesville, FL, USA.
⁵ Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA.
⁶ Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, USA.
⁷ Pacific Biosciences, Menlo Park, CA, USA.
⁸ Institute for Integrative Systems Biology, Spanish National Research Council, Paterna, Valencia, Spain. ana.conesa@csic.es.

^# Contributed equally.

PMID: 38509328
PMCID: PMC11093726
DOI: 10.1038/s41592-024-02229-2

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

Francisco J Pardo-Palacios et al. Nat Methods. 2024 May.

. 2024 May;21(5):793-797.

doi: 10.1038/s41592-024-02229-2. Epub 2024 Mar 20.

Authors

Affiliations

¹ Institute for Integrative Systems Biology, Spanish National Research Council, Paterna, Valencia, Spain.
² Department of Applied Statistics and Operational Research, and Quality, Universitat Politècnica de València, Valencia, Valencia, Spain.
³ Horticultural Sciences Department, University of Florida, Gainesville, FL, USA.
⁴ Genetics Institute, University of Florida, Gainesville, FL, USA.
⁵ Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA.
⁶ Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL, USA.
⁷ Pacific Biosciences, Menlo Park, CA, USA.
⁸ Institute for Integrative Systems Biology, Spanish National Research Council, Paterna, Valencia, Spain. ana.conesa@csic.es.

^# Contributed equally.

PMID: 38509328
PMCID: PMC11093726
DOI: 10.1038/s41592-024-02229-2

Abstract

SQANTI3 is a tool designed for the quality control, curation and annotation of long-read transcript models obtained with third-generation sequencing technologies. Leveraging its annotation framework, SQANTI3 calculates quality descriptors of transcript models, junctions and transcript ends. With this information, potential artifacts can be identified and replaced with reliable sequences. Furthermore, the integrated functional annotation feature enables subsequent functional iso-transcriptomics analyses.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests

Figures

**Fig. 1. Overview of SQANTI3.**
a, SQANTI3 workflow. b, Main SQANTI structural categories for transcript models of known genes. c, SQANTI3 subcategories for FSM and ISM transcripts. d, Orthogonal data features processed by SQANTI3 QC. LR, long read; SJ, splice junction.

**Fig. 2. Validation of SQANTI3 features.**
a, Distribution of transcript models by SQANTI3 structural categories in the WTC11 IsoSeq3-defined transcriptome. b, Differences in TSS ratio between TSS supported and not supported by CAGE-seq data (****P = 2 × 10⁻¹⁶, two-sided Wilcoxon test). c, Uneven support of CAGE-seq data for FSM and ISM transcript models, with supported TSS usually having a TSS ratio > 1.5 (red dashed line), particularly if they are also known TSS. d, Variable importance of SQANTI3 descriptors in the machine learning (ML) filter for different input scenarios, obtained after training the random forest model on different true-positive (TP) sets. e, Distribution of values for the top machine learning filter variables (ranked by random forest classifier importance) for isoforms and artifacts across the FSM (35,446 isoforms, 21,079 artifacts), ISM (2,573 isoforms, 65,114 artifacts), NIC (15,504 isoforms, 33,367 artifacts) and NNC (1,556 isoforms, 34,182 artifacts) structural categories. f, Variation in the number of genes and transcripts after filter and rescue using sample-specific orthogonal data with rules and machine learning approaches. g, Performance metrics according to SIRV detection at each step of the SQANTI3 pipeline using rules and machine learning approaches. FDR, false discovery rate; NDR, novel detection rate; ODR, over-annotation detection rate For all boxplots in this figure, the middle line represents the median, the ends of the box represent the 25th (quartile 1) and 75th (quartile 3) percentiles, and the whiskers represent the minimum (quartile 1 minus 1.5-fold the interquartile range (IQR)) and the maximum (quartile 3 plus 1.5-fold the IQR). The half-violin plots show the density distribution of values.

**Extended Data Fig. 1. SQANTI3 Rescue workflow.**
1) If an FSM-supported reference transcript is lost during the filtering, the version of the reference is automatically rescued. 2) The rest of the LR-defined transcript models filtered out (rescue candidates) are mapped against the reference transcriptome combined with the accepted LR-defined isoforms (rescue targets), allowing several hits per candidate. 3) Reference transcriptome was previously evaluated and filtered with the same data and criteria as the LR-defined transcripts. 4) Rescue is completed by evaluating targets. They need to pass the filtering and not increase the redundancy, meaning that if the target is an LR-defined transcript present or it is a reference transcript already represented as an FSM in the filtered transcriptome, these targets will not be added to the final annotation. LR: Long-read, ML: Machine Learning, FSM: Full-Splice-Match, ISM: Incomplete-Splice-Match, NIC: Novel-In-Catalog, NNC: Novel-Not-In-Catalog.

**Extended Data Fig. 2. Agreement in TSS validation using different data sources. of additional information.**
Number of TSS identified using the TSS ratio (threshold=1.5) based on matching short-reads RNA-seq data, sample-specific CAGE-seq data and the refTSS database. TSS: Transcript Starting Site.

**Extended Data Fig. 3. Agreement in TTS validation using different data sources.**
Number of TTS identified using sample-specific Quant-seq data, presence of polyA motif and the PolyASite database. WTC11 PacBio lrRNA-seq data. TTS: Transcript Termination Site.

**Extended Data Fig. 4. Frequency distribution of the transcript model distances between their detected polyA motif and the closest reference polyA site.**
Data are stratified by SQANTI3 structural category and separated according to the existing Quant-seq data support. WTC11 PacBio lrRNA-seq data.

**Extended Data Fig. 5. Distribution of transcript model distances between their detected polyA motif and the closest reference polyA site.**
Data are broken-down by SQANTI3 structural category and separated depending on whether transcript models were flagged as potential intrapriming artifact. Boxes indicate median (middle line), 25th (Q1) and 75th (Q3) percentiles (box hinges); whiskers represent min = Q1 - 1.5 ⋅ Interquartile Range (IQR) and max = Q3 + 1.5 ⋅ IQR; dots constitute outliers. FSM: Full-Splice-Match. ISM: Incomplete-Splice-Match, NIC: Novel-In-Catalog, NNC: Novel-Not-In-Catalog.

Extended Data Fig. 6. Relationship between the SQANTI3 structural categories of discarded transcripts (rescue candidates) and their rescue targets in the Machine Learning (ML) - High Input Sample filtering scenario.
Rescue candidates are shown in the y-axis, stratified by structural category. Candidates correspond to transcripts discarded by the ML filter, that is artifacts. Rescue targets are shown in the x-axis, spread across structural categories and including reference transcriptome hits. Targets correspond to transcripts mapped by artifacts during the rescue process. In this mapping process, each candidate can map to multiple targets, which are similar to the candidate in sequence and exon structure. Heatmap color therefore corresponds to the number of hits (log10) involving each possible pair of structural categories, indicating the amount of structural similarity among categories detected during rescue. Within the tiles, the total number of candidate target pairs is shown, including the mean number of hits per candidate for each category pair between parentheses. FSM candidates only match reference targets, since they are only considered for automatic rescue. WTC11 PacBio lrRNA-seq data. FSM: Full-Splice-Match, ISM: Incomplete-Splice-Match, NIC: Novel-In-Catalog, NNC: Novel-Not-In-Catalog.

**Extended Data Fig. 7. Expression and functional properties of rescued transcripts.**
a, Distribution of expression values (TPM) of known transcripts detected as Full-Splice-Match or Incomplete-Splice-Match. b, TRIFID scores of known transcripts identified in each filtering and rescue scenario. Filtered transcripts (orange) did not pass the corresponding filter and were not eventually rescued. Transcripts filtered but recovered by introducing an isoform from the reference (dark blue) represent the rescue strategy’s fundamental purpose. In exceptional cases, transcripts models not initially detected were included in the final transcriptome (yellow) via rescue.

See this image and copyright information in PMC

Update of

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms.
Pardo-Palacios FJ, Arzalluz-Luque A, Kondratova L, Salguero P, Mestre-Tomás J, Amorín R, Estevan-Morió E, Liu T, Nanni A, McIntyre L, Tseng E, Conesa A. Pardo-Palacios FJ, et al. bioRxiv [Preprint]. 2023 Jun 3:2023.05.17.541248. doi: 10.1101/2023.05.17.541248. bioRxiv. 2023. Update in: Nat Methods. 2024 May;21(5):793-797. doi: 10.1038/s41592-024-02229-2. PMID: 37398077 Free PMC article. Updated. Preprint.

References

1. Marx V. Method of the year: long-read sequencing. Nat. Methods. 2023;20:6–11. doi: 10.1038/s41592-022-01730-w. - DOI - PubMed
1. Foord C, et al. The variables on RNA molecules: concert or cacophony? Answers in long-read sequencing. Nat. Methods. 2023;20:20–24. doi: 10.1038/s41592-022-01715-9. - DOI - PubMed
1. Lucas MC, Novoa EM. Long-read sequencing in the era of epigenomics and epitranscriptomics. Nat. Methods. 2023;20:25–29. doi: 10.1038/s41592-022-01724-8. - DOI - PubMed
1. Tilgner H, Grubert F, Sharon D, Snyder MP. Defining a personal, allele-specific, and single-molecule long-read transcriptome. Proc. Natl Acad. Sci. USA. 2014;111:9869–9874. doi: 10.1073/pnas.1400447111. - DOI - PMC - PubMed
1. Singh M, et al. High-throughput targeted long-read single cell sequencing reveals the clonal and transcriptional landscape of lymphocytes. Nat. Commun. 2019;10:3120. doi: 10.1038/s41467-019-11049-4. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R21 HG011280/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

Affiliations

SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources