Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 9;17(1):94.
doi: 10.1186/s13020-022-00644-1.

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Affiliations

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Peng Zeng et al. Chin Med. .

Abstract

Background: Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing offer great advantages in de novo genome assembly, especially for complex genomes with high heterozygosity and repetitive content. Currently, multiple allotetraploid species have sequenced their genomes by long-read sequencing. However, we found that a considerable proportion of these genomes (7.9% on average, maximum 23.7%) could not be covered by NGS (Next Generation Sequencing) reads (uncovered region by NGS reads, UCR) suggesting the questionable and low-quality of those area or genomic areas that can't be sequenced by NGS due to sequencing bias. The underlying causes of those UCR in the genome assembly and solutions to this problem have never been studied.

Methods: In the study, we sequenced the tetraploid genome of Veratrum dahuricum (Turcz.) O. Loes (VDL), a Chinese medicinal plant, with ONT platform and assembled the genome with three strategies in parallel. We compared the qualities, coverage, and heterozygosity of the three ONT assemblies with another released assembly of the same individual using reads from PacBio circular consensus sequencing (CCS) technology, to explore the cause of the UCR.

Results: By mapping the NGS reads against the three ONT assemblies and the CCS assembly, we found that the coverage of those ONT assemblies by NGS reads ranged from 49.15 to 76.31%, much smaller than that of the CCS assembly (99.53%). And alignment between ONT assemblies and CCS assembly showed that most UCR can be aligned with CCS assembly. So, we conclude that the UCRs in ONT assembly are low-quality sequences with a high error rate that can't be aligned with short reads, rather than genomic regions that can't be sequenced by NGS. Further comparison among the intermediate versions of ONT assemblies showed that the most probable origin of those errors is a combination of artificial errors introduced by "self-correction" and initial sequencing error in long reads. We also found that polishing the ONT assembly with CCS reads can correct those errors efficiently.

Conclusions: Through analyzing genome features and reads alignment, we have found the causes for the high proportion of UCR in ONT assembly of VDL are sequencing errors and additional errors introduced by self-correction. The high error rates of ONT-raw reads make them not suitable for self-correction prior to allotetraploid genome assembly, as the self-correction will introduce artificial errors to > 5% of the UCR sequences. We suggest high-precision CCS reads be used to polish the assembly to correct those errors effectively for polyploid genomes.

Keywords: Allotetraploid; Homozygous variants; Low-quality sequences; ONT-based assembly; Veratrum dahuricum.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Allotetraploid inference of VDL. a Phylogenetic tree of Veratrum based on the chloroplast trnL–trnF gene spacer sequence. The data set (trnL–trnF) of 15 veratrum plants was used to build a representative family-level tree. Nucleotide sequences alignment was made using muscle software and the best tree was generated by the command “raxml-ng -msa Veratrum.trnL–trnF.fa.muscle --msa-format FASTA --data-type DNA --all --model GTR + G --threads 1 --bs-trees 100 –redo”. The phylogenetic tree is consistent with the tree constructed by Pellicer, et al. [25], and VDL is located in “2n = 4x = 32” clade, suggesting tetraploid. b Dot-plot of VDL orthologs, collinearity analysis of the CCS-hifiasm assembly was conducted using WGDI pipeline [28]. c The synonymous substitutions (Ks) frequency density distributions of orthologs, the Ks peak was detected to be 0.08
Fig. 2
Fig. 2
Nucleic acid alignment between ONT assemblies and CCS-hifiasm assembly. Two ONT-based assemblies were mapped to chromosome-level CCS-hifiasm assembly using minimap2, and the approximate per-base sequence divergence of each block was extracted from alignments. Blocks were grouped according to size
Fig. 3
Fig. 3
The UCR ratio and divergence between ONT assembly and CCS-hifiasm assembly. We cut the (> 200 kb) blocks between ONT-nextdenovo assembly and CCS-hifiasm assembly into 100 kb bins and counted the divergence between ONT-nextdenovo assembly and ONT-hifiasm assembly, finding that the divergence was positively associated with UCR ratio (cor = 0.942, p-value < 2.2e–16)
Fig. 4
Fig. 4
Distributions of discordance rate of ONT and NGS reads. The discordance between ONT-raw reads and ONT-nextdenovo assembly was calculated in 100 bp bins, bins with UCR length > 90% were regarded as high UCR. Both distributions of mismatch and gaps sequencing error in high UCR are higher than that of the whole genome. The blue cumulative line represents the cumulative distribution of NGS reads with mismatch rate, 92.2% of mapped NGS reads have a mismatch rate of ≤ 2%, and the average genome-wide mismatch rate is 0.68%
Fig. 5
Fig. 5
An example of sequence reads mapped to ONT-nextdenovo assembly. The 20 kb region of ctg001275 of ONT-nextdenovo assembly was used to show the reads mapping by CCS, CCS2short, NGS, ONT, ONT-1-correct, and ONT-1-correct2short reads. The mismatch of long reads is higher in areas where short reads are not covered
Fig. 6
Fig. 6
Complete discordances covered by long reads. Two complete discordances were detected using CCS reads. Correspondingly, in ONT reads and Corrected reads, the discordance rates (coverage tracks) were close to 100% and 50%, respectively. It suggests that the genotypes of ONT raw reads were consistent with that of CCS reads, but the error-correction process introduces errors, resulting in nearly half of the genotypes of the corrected reads being different from CCS reads. Multiple reads are secondary mapping (blank strips) in the ONT reads alignments, and their primary alignments were in other homologous regions, which may interfere with the error-correction process. The blue, red, green, and orange blocks represent “C”, “T”, “A”, and “G” genotypes, respectively. Gray and blank strips represent primary alignment and secondary alignment, respectively
Fig. 7
Fig. 7
A pattern of ONT reads self-correction. a For diploid, the homozygous base ‘A’ and the heterozygous base ‘C/T’ were corrected to be ‘A’ and ‘C’, respectively. Colorful blocks stand for conserved regions. b for tetraploid, in subgenome-A, reads r3 was sequenced “C->G” in error, and reads r11 of subgenome-B was “G->T”. As the existence of conserved regions (green and purple), reads r1–4, and r9–12 were clustered to perform error correction, resulting in two homozygous SNPs for the subgenomes using CCS reads and one heterozygous SNP using corrected ONT reads

Similar articles

Cited by

References

    1. Zapata L, Ding J, Willing E-M, Hartwig B, Bezdan D, Jiao W-B, et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc Natl Acad Sci USA. 2016;113:E4052–60. doi: 10.1073/pnas.1607532113. - DOI - PMC - PubMed
    1. Redwan RM, Saidin A, Kumar SV. The draft genome of MD-2 pineapple using hybrid error correction of long reads. DNA Res. 2016;23:427–39. doi: 10.1093/dnares/dsw026. - DOI - PMC - PubMed
    1. Yang N, Liu J, Gao Q, Gui S, Chen L, Yang L, et al. Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement. Nat Genet. 2019;51:1052–9. doi: 10.1038/s41588-019-0427-6. - DOI - PubMed
    1. Lv H, Wang Y, Han F, Ji J, Fang Z, Zhuang M, et al. A high-quality reference genome for cabbage obtained with SMRT reveals novel genomic features and evolutionary characteristics. Sci Rep. 2020;10:12394. doi: 10.1038/s41598-020-69389-x. - DOI - PMC - PubMed
    1. Deschamps S, Zhang Y, Llaca V, Ye L, Sanyal A, King M, et al. A chromosome-scale assembly of the sorghum genome using nanopore sequencing and optical mapping. Nat Commun. 2018;9:4844. doi: 10.1038/s41467-018-07271-1. - DOI - PMC - PubMed

LinkOut - more resources