Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Nov 5;22(6):bbab170.
doi: 10.1093/bib/bbab170.

ARAMIS: From systematic errors of NGS long reads to accurate assemblies

Affiliations

ARAMIS: From systematic errors of NGS long reads to accurate assemblies

E Sacristán-Horcajada et al. Brief Bioinform. .

Abstract

NGS long-reads sequencing technologies (or third generation) such as Pacific BioSciences (PacBio) have revolutionized the sequencing field over the last decade improving multiple genomic applications like de novo genome assemblies. However, their error rate, mostly involving insertions and deletions (indels), is currently an important concern that requires special attention to be solved. Multiple algorithms are available to fix these sequencing errors using short reads (such as Illumina), although they require long processing times and some errors may persist. Here, we present Accurate long-Reads Assembly correction Method for Indel errorS (ARAMIS), the first NGS long-reads indels correction pipeline that combines several correction software in just one step using accurate short reads. As a proof OF concept, six organisms were selected based on their different GC content, size and genome complexity, and their PacBio-assembled genomes were corrected thoroughly by this pipeline. We found that the presence of systematic sequencing errors in long-reads PacBio sequences affecting homopolymeric regions, and that the type of indel error introduced during PacBio sequencing are related to the GC content of the organism. The lack of knowledge of this fact leads to the existence of numerous published studies where such errors have been found and should be resolved since they may contain incorrect biological information. ARAMIS yields better results with less computational resources needed than other correction tools and gives the possibility of detecting the nature of the found indel errors found and its distribution along the genome. The source code of ARAMIS is available at https://github.com/genomics-ngsCBMSO/ARAMIS.git.

Keywords: error correction; genome assembly; homopolymer; long read; next-generation sequencing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic overview of the workflow leading to the correction of PacBio-only genome assembly. Input files (Raw Reads) are represented as yellow rhomboids. All the different software is shown in blue boxes. Those tools implemented in the Correction_step of ARAMIS are shown with red borders. Dark-blue borders show the set of tools used by the Statistical_step of ARAMIS. Those common indels in which the pipeline is unable to determine the correct sequence are flagged as warnings (orange boxes) for posterior manual curation (−w option). This additional step, outside the main pipeline, is represented with a dashed line. Output files are represented in green boxes. Discarded data are shown in red boxes.
Figure 2
Figure 2
Indel distribution across T. thermophilus chromosome based on indel fraction calculated for both sequencing technologies. Blue and red dots show the indel fraction and position according to PacBio and Illumina reads alignment, respectively. The indel fraction threshold used is pointed with a horizontal line.
Figure 3
Figure 3
KDE plot of indels detected based on the genome GC content. P. falciparum, E. coli, L. infantum, T. thermophilus, M. hassiacum and Tessaracoccus sp. are represented in red, yellow, green, light-blue, dark-blue and pink curves, respectively.
Figure 4
Figure 4
Venn diagram of indels detected by PacBio-utilities and Pilon software. PacBio-utilities indels flagged as Good (green), PacBio-utilities indels flagged as Bad (red), and indels detected by Pilon (blue) are shown. Panels A–F show the indels detected in the six organisms studied.
Figure 5
Figure 5
Variant distribution in homopolymers based on length and nucleotide. Red, green, blue and purple bars correspond to A, C, G and T affected homopolymers, respectively. Panels A–F show the type of homopolymers affected in each organism.
Figure 6
Figure 6
Frequency distribution of PacBio subread lengths in the six organisms studied. The length distribution of the datasets is shown as a frequency plot (Panels A–F) and as a violin plot (Panel G).
Figure 7
Figure 7
Processing time (minutes) of the correction process with each correction software in relation to genome size (normalized on a logarithmic scale). Red, orange and blue bars represent ARAMIS, RACON and LoRMA performance, respectively.
Figure 8
Figure 8
Maximum RAM memory usage (GB) used by each correction software in relation to genome size (normalized on a logarithmic scale). Red, orange and blue bars represent ARAMIS, RACON and LoRMA performance, respectively.

References

    1. Djik EL, Jaszczyszyn Y, Daquin D, et al. The third revolution in sequencing technology. Trends Genet 2018;34(9):666–81. doi: 10.1016/j.tig.2018.05.008. - DOI - PubMed
    1. Ardui S, Ameur A, Vermeesch JR, et al. Single molecule real time (SMRT) sequencing comes of age: applications and utilities for medical diagnostic. Nucleics Acid Research 2018;46(5):2159–68. doi: 10.1093/nar/gky066. - DOI - PMC - PubMed
    1. Weirather JL, Cesare M, Wang Y, et al. Comprehensive comparison of Pacific Biosciences and Oxford Nanopore technologies and their applications to transcriptome analysis. F1000Research 2017;6:100. doi: 10.12688/f1000research.10571.2. - DOI - PMC - PubMed
    1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016;17(6):333–51. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed
    1. Mitsuhashi S, Frith MC, Mizuguchi T, et al. Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol 2019;20(1):58. doi: 10.1186/s13059-019-1667-6. - DOI - PMC - PubMed

Publication types