Highly accurate assembly polishing with DeepPolisher
- PMID: 40389286
- PMCID: PMC12212083
- DOI: 10.1101/gr.280149.124
Highly accurate assembly polishing with DeepPolisher
Abstract
Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
© 2025 Mastoras et al.; Published by Cold Spring Harbor Laboratory Press.
Update of
-
Highly accurate assembly polishing with DeepPolisher.bioRxiv [Preprint]. 2024 Sep 19:2024.09.17.613505. doi: 10.1101/2024.09.17.613505. bioRxiv. 2024. Update in: Genome Res. 2025 Jul 1;35(7):1595-1608. doi: 10.1101/gr.280149.124. PMID: 39345401 Free PMC article. Updated. Preprint.
References
-
- Carroll A, Kolesnikov A, Cook DE, Brambrink L, Wiseman KN, Billings SM, Kruglyak S, Lajoie BR, Zhao J, Levy SE, et al. 2023. Accurate human genome analysis with element avidity sequencing. bioRxiv 10.1101/2023.08.11.553043 - DOI
MeSH terms
Grants and funding
- U01 HG010961/HG/NHGRI NIH HHS/United States
- T32 HG012344/HG/NHGRI NIH HHS/United States
- U41 HG010972/HG/NHGRI NIH HHS/United States
- U01 HG013744/HG/NHGRI NIH HHS/United States
- OT2 OD026682/OD/NIH HHS/United States
- R01 HG011274/HG/NHGRI NIH HHS/United States
- U01 HG013755/HG/NHGRI NIH HHS/United States
- U24 HG010262/HG/NHGRI NIH HHS/United States
- U01 HG010971/HG/NHGRI NIH HHS/United States
- R01 HG010485/HG/NHGRI NIH HHS/United States
- U01 HG013760/HG/NHGRI NIH HHS/United States
- U24 HG011853/HG/NHGRI NIH HHS/United States
- U01 HG013748/HG/NHGRI NIH HHS/United States