Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 1;35(7):1595-1608.
doi: 10.1101/gr.280149.124.

Highly accurate assembly polishing with DeepPolisher

Collaborators, Affiliations

Highly accurate assembly polishing with DeepPolisher

Mira Mastoras et al. Genome Res. .

Abstract

Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.

PubMed Disclaimer

Update of

References

    1. Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, Avdeyev P, Taylor DJ, Shafin K, Shumate A, Xiao C, et al. 2022. A complete reference genome improves analysis of human genetic variation. Science 376: eabl3533. 10.1126/science.abl3533 - DOI - PMC - PubMed
    1. Baid G, Cook DE, Shafin K, Yun T, Llinares-López F, Berthet Q, Belyaeva A, Töpfer A, Wenger AM, Rowell WJ, et al. 2022. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41: 232–238. 10.1038/s41587-022-01435-7 - DOI - PubMed
    1. Benjamini Y, Speed TP. 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40: e72. 10.1093/nar/gks001 - DOI - PMC - PubMed
    1. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. 2022. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38: 2102–2110. 10.1093/bioinformatics/btac020 - DOI - PMC - PubMed
    1. Carroll A, Kolesnikov A, Cook DE, Brambrink L, Wiseman KN, Billings SM, Kruglyak S, Lajoie BR, Zhao J, Levy SE, et al. 2023. Accurate human genome analysis with element avidity sequencing. bioRxiv 10.1101/2023.08.11.553043 - DOI