. 2022 Aug 9;17(1):94.

doi: 10.1186/s13020-022-00644-1.

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Peng Zeng^#¹, Zunzhe Tian^#², Yuwei Han^#², Weixiong Zhang¹, Tinggan Zhou², Yingmei Peng², Hao Hu^#³, Jing Cai^#⁴

Affiliations

¹ State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China.
² School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, China.
³ State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China. haohu@um.edu.mo.
⁴ School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, China. jingcai@nwpu.edu.cn.

^# Contributed equally.

PMID: 35945546
PMCID: PMC9364492
DOI: 10.1186/s13020-022-00644-1

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Peng Zeng et al. Chin Med. 2022.

. 2022 Aug 9;17(1):94.

doi: 10.1186/s13020-022-00644-1.

Authors

Peng Zeng^#¹, Zunzhe Tian^#², Yuwei Han^#², Weixiong Zhang¹, Tinggan Zhou², Yingmei Peng², Hao Hu^#³, Jing Cai^#⁴

Affiliations

¹ State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China.
² School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, China.
³ State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Macau, China. haohu@um.edu.mo.
⁴ School of Ecology and Environment, Northwestern Polytechnical University, Xi'an, China. jingcai@nwpu.edu.cn.

^# Contributed equally.

PMID: 35945546
PMCID: PMC9364492
DOI: 10.1186/s13020-022-00644-1

Abstract

Background: Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing offer great advantages in de novo genome assembly, especially for complex genomes with high heterozygosity and repetitive content. Currently, multiple allotetraploid species have sequenced their genomes by long-read sequencing. However, we found that a considerable proportion of these genomes (7.9% on average, maximum 23.7%) could not be covered by NGS (Next Generation Sequencing) reads (uncovered region by NGS reads, UCR) suggesting the questionable and low-quality of those area or genomic areas that can't be sequenced by NGS due to sequencing bias. The underlying causes of those UCR in the genome assembly and solutions to this problem have never been studied.

Methods: In the study, we sequenced the tetraploid genome of Veratrum dahuricum (Turcz.) O. Loes (VDL), a Chinese medicinal plant, with ONT platform and assembled the genome with three strategies in parallel. We compared the qualities, coverage, and heterozygosity of the three ONT assemblies with another released assembly of the same individual using reads from PacBio circular consensus sequencing (CCS) technology, to explore the cause of the UCR.

Results: By mapping the NGS reads against the three ONT assemblies and the CCS assembly, we found that the coverage of those ONT assemblies by NGS reads ranged from 49.15 to 76.31%, much smaller than that of the CCS assembly (99.53%). And alignment between ONT assemblies and CCS assembly showed that most UCR can be aligned with CCS assembly. So, we conclude that the UCRs in ONT assembly are low-quality sequences with a high error rate that can't be aligned with short reads, rather than genomic regions that can't be sequenced by NGS. Further comparison among the intermediate versions of ONT assemblies showed that the most probable origin of those errors is a combination of artificial errors introduced by "self-correction" and initial sequencing error in long reads. We also found that polishing the ONT assembly with CCS reads can correct those errors efficiently.

Conclusions: Through analyzing genome features and reads alignment, we have found the causes for the high proportion of UCR in ONT assembly of VDL are sequencing errors and additional errors introduced by self-correction. The high error rates of ONT-raw reads make them not suitable for self-correction prior to allotetraploid genome assembly, as the self-correction will introduce artificial errors to > 5% of the UCR sequences. We suggest high-precision CCS reads be used to polish the assembly to correct those errors effectively for polyploid genomes.

Keywords: Allotetraploid; Homozygous variants; Low-quality sequences; ONT-based assembly; Veratrum dahuricum.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Allotetraploid inference of VDL. a Phylogenetic tree of *Veratrum* based on the chloroplast trnL–trnF gene spacer sequence. The data set (trnL–trnF) of 15 veratrum plants was used to build a representative family-level tree. Nucleotide sequences alignment was made using muscle software and the best tree was generated by the command “raxml-ng -msa Veratrum.trnL–trnF.fa.muscle --msa-format FASTA --data-type DNA --all --model GTR + G --threads 1 --bs-trees 100 –redo”. The phylogenetic tree is consistent with the tree constructed by Pellicer, et al. [25], and VDL is located in “2n = 4x = 32” clade, suggesting tetraploid. b Dot-plot of VDL orthologs, collinearity analysis of the CCS-hifiasm assembly was conducted using WGDI pipeline [28]. c The synonymous substitutions (Ks) frequency density distributions of orthologs, the Ks peak was detected to be 0.08

**Fig. 2**
Nucleic acid alignment between ONT assemblies and CCS-hifiasm assembly. Two ONT-based assemblies were mapped to chromosome-level CCS-hifiasm assembly using minimap2, and the approximate per-base sequence divergence of each block was extracted from alignments. Blocks were grouped according to size

**Fig. 3**
The UCR ratio and divergence between ONT assembly and CCS-hifiasm assembly. We cut the (> 200 kb) blocks between ONT-nextdenovo assembly and CCS-hifiasm assembly into 100 kb bins and counted the divergence between ONT-nextdenovo assembly and ONT-hifiasm assembly, finding that the divergence was positively associated with UCR ratio (cor = 0.942, p-value < 2.2e–16)

**Fig. 4**
Distributions of discordance rate of ONT and NGS reads. The discordance between ONT-raw reads and ONT-nextdenovo assembly was calculated in 100 bp bins, bins with UCR length > 90% were regarded as high UCR. Both distributions of mismatch and gaps sequencing error in high UCR are higher than that of the whole genome. The blue cumulative line represents the cumulative distribution of NGS reads with mismatch rate, 92.2% of mapped NGS reads have a mismatch rate of ≤ 2%, and the average genome-wide mismatch rate is 0.68%

**Fig. 5**
An example of sequence reads mapped to ONT-nextdenovo assembly. The 20 kb region of ctg001275 of ONT-nextdenovo assembly was used to show the reads mapping by CCS, CCS2short, NGS, ONT, ONT-1-correct, and ONT-1-correct2short reads. The mismatch of long reads is higher in areas where short reads are not covered

**Fig. 6**
Complete discordances covered by long reads. Two complete discordances were detected using CCS reads. Correspondingly, in ONT reads and Corrected reads, the discordance rates (coverage tracks) were close to 100% and 50%, respectively. It suggests that the genotypes of ONT raw reads were consistent with that of CCS reads, but the error-correction process introduces errors, resulting in nearly half of the genotypes of the corrected reads being different from CCS reads. Multiple reads are secondary mapping (blank strips) in the ONT reads alignments, and their primary alignments were in other homologous regions, which may interfere with the error-correction process. The blue, red, green, and orange blocks represent “C”, “T”, “A”, and “G” genotypes, respectively. Gray and blank strips represent primary alignment and secondary alignment, respectively

**Fig. 7**
A pattern of ONT reads self-correction. a For diploid, the homozygous base ‘A’ and the heterozygous base ‘C/T’ were corrected to be ‘A’ and ‘C’, respectively. Colorful blocks stand for conserved regions. b for tetraploid, in subgenome-A, reads r3 was sequenced “C->G” in error, and reads r11 of subgenome-B was “G->T”. As the existence of conserved regions (green and purple), reads r1–4, and r9–12 were clustered to perform error correction, resulting in two homozygous SNPs for the subgenomes using CCS reads and one heterozygous SNP using corrected ONT reads

See this image and copyright information in PMC

Cited by

Application of third-generation sequencing to herbal genomics.
Gao L, Xu W, Xin T, Song J. Gao L, et al. Front Plant Sci. 2023 Mar 7;14:1124536. doi: 10.3389/fpls.2023.1124536. eCollection 2023. Front Plant Sci. 2023. PMID: 36959935 Free PMC article. Review.
Application of third-generation sequencing technology in the genetic testing of thalassemia.
Li W, Ye Y. Li W, et al. Mol Cytogenet. 2024 Dec 18;17(1):32. doi: 10.1186/s13039-024-00701-4. Mol Cytogenet. 2024. PMID: 39696632 Free PMC article. Review.
RNA isoform expression landscape of the human dorsal root ganglion generated from long-read sequencing.
Arendt-Tranholm A, Mwirigi JM, Price TJ. Arendt-Tranholm A, et al. Pain. 2024 Nov 1;165(11):2468-2481. doi: 10.1097/j.pain.0000000000003255. Epub 2024 May 16. Pain. 2024. PMID: 38809314
RNA isoform expression landscape of the human dorsal root ganglion (DRG) generated from long read sequencing.
Arendt-Tranholm A, Mwirigi JM, Price TJ. Arendt-Tranholm A, et al. bioRxiv [Preprint]. 2023 Nov 1:2023.10.28.564535. doi: 10.1101/2023.10.28.564535. bioRxiv. 2023. Update in: Pain. 2024 Nov 1;165(11):2468-2481. doi: 10.1097/j.pain.0000000000003255. PMID: 37961262 Free PMC article. Updated. Preprint.

References

1. Zapata L, Ding J, Willing E-M, Hartwig B, Bezdan D, Jiao W-B, et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc Natl Acad Sci USA. 2016;113:E4052–60. doi: 10.1073/pnas.1607532113. - DOI - PMC - PubMed
1. Redwan RM, Saidin A, Kumar SV. The draft genome of MD-2 pineapple using hybrid error correction of long reads. DNA Res. 2016;23:427–39. doi: 10.1093/dnares/dsw026. - DOI - PMC - PubMed
1. Yang N, Liu J, Gao Q, Gui S, Chen L, Yang L, et al. Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement. Nat Genet. 2019;51:1052–9. doi: 10.1038/s41588-019-0427-6. - DOI - PubMed
1. Lv H, Wang Y, Han F, Ji J, Fang Z, Zhuang M, et al. A high-quality reference genome for cabbage obtained with SMRT reveals novel genomic features and evolutionary characteristics. Sci Rep. 2020;10:12394. doi: 10.1038/s41598-020-69389-x. - DOI - PMC - PubMed
1. Deschamps S, Zhang Y, Llaca V, Ye L, Sanyal A, King M, et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nat Commun. 2018;9:4844. doi: 10.1038/s41467-018-07271-1. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Affiliations

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources