From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA)

Affiliations

¹ Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), Consejo Superior de Investigaciones Científicas, 18016, Granada, Spain.
² Department for Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
³ Centro Internacional de Entrenamiento e Investigaciones Médicas (CIDEIM), Cali, Colombia.
⁴ School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK.
⁵ Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, 4123 Allschwil, Switzerland.
⁶ University of Basel, 4001 Basel, Switzerland.
⁷ Departamento de Microbiología, Facultad de Salud, Universidad del Valle, Cali, Colombia.
⁸ KEMRI-Wellcome Trust Research Programme, CGMRC, Kilifi, Kenya.

PMID: 37406192
PMCID: PMC10359078
DOI: 10.1093/bib/bbad248

From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA)

José Luis Ruiz et al. Brief Bioinform. 2023.

. 2023 Jul 20;24(4):bbad248.

doi: 10.1093/bib/bbad248.

Authors

Affiliations

¹ Instituto de Parasitología y Biomedicina López-Neyra (IPBLN), Consejo Superior de Investigaciones Científicas, 18016, Granada, Spain.
² Department for Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
³ Centro Internacional de Entrenamiento e Investigaciones Médicas (CIDEIM), Cali, Colombia.
⁴ School of Infection & Immunity, MVLS, University of Glasgow, Glasgow, UK.
⁵ Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, 4123 Allschwil, Switzerland.
⁶ University of Basel, 4001 Basel, Switzerland.
⁷ Departamento de Microbiología, Facultad de Salud, Universidad del Valle, Cali, Colombia.
⁸ KEMRI-Wellcome Trust Research Programme, CGMRC, Kilifi, Kenya.

PMID: 37406192
PMCID: PMC10359078
DOI: 10.1093/bib/bbad248

Abstract

Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.

Keywords: de novo assembly; automatic finishing; bioinformatics; genome polishing; next-generation sequencing; pipeline.

PubMed Disclaimer

Figures

**Figure 1**
Example of a frameshift error in one of the gene models in our long read genome assemblies of *P. falciparum* due to the presence of a homopolymer tract. Artemis visualization of a PacBio genome assembly (bottom panel) and the aligned Illumina short reads (top panel, horizontal blue bars). Reads mapping to the forward strand are on top, and to the reverse below. Sequencing errors in the Illumina short reads are marked with vertical light red lines. A homopolymer tract of 17 A’s is highlighted in yellow. The quality of the reads drops after the homopolymer, and accordingly it can be seen that reads on the forward strand have just few sequencing errors, but after the homopolymer tract the error rate is high. This tract is not sequenced correctly, it generates a frameshift and therefore causes a gene model to be wrongly annotated as a pseudogene. In the bottom panel, the two light blue boxes represent exons that due to the indel are split into two. *Ab initio* gene finders could try to build an intron here (losing exon sequence) or to generate a pseudogene. In the zoom-in visualization (right), the dark red vertical lines in the aligned Illumina short reads point to bases that are missing from the short repetition in the assembly, resulting in the homopolymer tract causing the frameshift.

**Figure 2**
Differential frameshifts correction by Pilon and iCORN2. ACT visualization of a section of the Pf3D7 reference genome sequence, the corresponding section of an uncorrected *P. falciparum* 3D7 PacBio genome assembly, and the Pilon-corrected and iCORN2-corrected sequences. Syntenic regions (BLAST) are indicated in gray bars between the reference and the uncorrected assembly. Annotated genes in the reference are colored. Black vertical lines mark the absence of open reading frames (ORFs). Red squares mark the frameshifts within ORFs in the uncorrected genome sequences. These are differentially processed by Pilon and iCORN2, with multiple iterations of iCORN2 correcting more frameshifts than a single Pilon run. Green squares mark the correct and successively corrected ORFs, which based on the reference could be annotated as correct gene models instead of an excessive and incorrect annotation of pseudogenes.

**Figure 3**
Whole Genome Amplification (WGA) errors in the PfKE07 assembly. (A) Schematic error of WGA. DNA gets amplified (i), but then the polymerase strand switches and generates the reverse strand (ii). This generates a chimeric read that generates mis-assemblies. (B) These chimeric reads generate assembly errors, as seen in an ACT view. The top part of the reference genome (gray arrow) is duplicated in the WGA-amplified genome. The assembly errors generally occur at the contig end, so gaps are generated. Syntenic regions when comparing to the reference genome (BLAST similarity hits) are indicated in gray. Mis-assemblies (inverted similarity hits) are indicated in black.

See this image and copyright information in PMC

References

1. Marx V. Long road to long-read assembly. Nat Methods 2021;18:125–9. - PubMed
1. Eid J, Fehr A, Gray J, et al. Real-time DNA sequencing from single polymerase molecules. Science 2009;323:133–8. - PubMed
1. Branton D, Deamer DW, Marziali A, et al. The potential and challenges of nanopore sequencing. Nat Biotechnol 2008;26:1146–53. - PMC - PubMed
1. Lewin HA, Robinson GE, Kress WJ, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci U S A 2018;115:4325–33. - PMC - PubMed
1. Chain PSG, Grafham DV, Fulton RS, et al. Genome project standards in a new era of sequencing. Science 2009;326(5950):236–7. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA)

Affiliations

From contigs towards chromosomes: automatic improvement of long read assemblies (ILRA)

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous