Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jan 9;10(1):giaa153.
doi: 10.1093/gigascience/giaa153.

Significantly improving the quality of genome assemblies through curation

Affiliations

Significantly improving the quality of genome assemblies through curation

Kerstin Howe et al. Gigascience. .

Abstract

Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Whilst working towards improved datasets and fully automated pipelines, assembly evaluation and curation is actively used to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in a gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Keywords: assembly; curation; gEVAL; genome.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Recommended workflow for curation activities during assembly generation.
Figure 2:
Figure 2:
Examples of assembly error signatures in different data types. (A) Assembly issue identified in gEVAL in a bird genome (Taeniopygia guttata, VGP). Feature tracks (named on the right) are shown in the context of the assembly. A misjoin is visible in the middle of the example, indicated by the drop in Pacific Biosciences (PacBio) read coverage, discordance with the aligned (yellow indicates aligned, and beige, not aligned) Bionano maps, and the break in synteny. The alignments with intermediate assembly stages show that this error was introduced by the scaffolding step involving scaff10x. (B–E) Assembly issues identified in HiGlass Hi-C 2D maps of a human assembly (HG002, varying assembly approaches). Scaffold boundaries are delineated in gray. (B) The first of the 2 scaffolds depicted here shows a misjoin (black arrow) that needs to be broken. The second scaffold reveals no structural issues. (C) The first and third of the 3 scaffolds shown here need to be joined as indicated by the green arrows. (D) The single scaffold depicted here has a misjoin (black arrow) that needs to be broken and rejoined as indicated by the green arrows. (E) This single scaffold contains a duplication, half of which needs to be excised (e.g., black arrows) and the scaffold rejoined (green arrows). The choice of the excised half can be based on phasing.
Figure 3:
Figure 3:
Comparison of the fbn2b region in the Danio rerio (zebrafish) reference assemblies Zv9 (top), GRCz10 (middle), and GRCz11 (bottom) in gEVAL. The fragmented fbn2b locus (colour coded in orange and red) was adjusted for GRCz10 (colour coded in orange) and further improved by removing whole-genome shotgun contigs in favour of finished clone sequence for GRCz11. The final correct gene locus is indicated in green.
Figure 4:
Figure 4:
Changes to 111 assemblies from different clades through manual assembly curation by the Genome Reference Informatics Team at the Wellcome Sanger Institute. (A) Manual interventions (breaks, joins, removal of false duplications) as events per gigabase of assembly sequence. (B) Changes in scaffold N50 after curation. (C) Changes in scaffold counts after curation. The depicted assemblies were created with PacBio CLR, Chromium 10X, Bionano, and Hi-C data.
Figure 5:
Figure 5:
Hi-C maps (pretext) showing the Asterias rubens (starfish) genome assembly (sequenced as part of the Sanger Institute's 25 Genomes for 25 Years project) before (A) and after (B) curation. The curation corrected the initial assembly by making 75 breaks and 216 joins and removed 1 stretch of erroneously duplicated sequence. A total of 97% of the assembly sequence could be assigned to 22 chromosomes. The curated assembly (B) contains 1 scaffold that is known to be associated with a second one (off-diagonal signal at bottom right), but its order and orientation are ambiguous. This scaffold has been submitted as “unlocalized” for the relevant chromosome.

References

    1. Rhie A, McCarthy SA, Fedrigo O, et al. Towards complete and error-free genome assemblies of all vertebrate species. bioRxiv 2020, doi:10.1101/2020.05.22.110833. - DOI - PMC - PubMed
    1. Miga KH, Koren S, Rhie A, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585:79–84. - PMC - PubMed
    1. Yang L-A, Chang Y-J, Chen S-H, et al. SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies. BMC Genomics. 2019;19:238. - PMC - PubMed
    1. Mapleson D, Garcia Accinelli G, Kettleborough G, et al. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 2017;33:574–6. - PMC - PubMed
    1. Seppey M, Manni M, Zdobnov EM. BUSCO: Assessing Genome Assembly and Annotation Completeness. Methods Mol Biol. 2019;1962:227–45. - PubMed

Publication types

LinkOut - more resources