. 2025 Jul 1;35(7):1595-1608.

doi: 10.1101/gr.280149.124.

Highly accurate assembly polishing with DeepPolisher

Collaborators, Affiliations

Collaborators

and the Human Pangenome Reference Consortium:
Ahmad Abou Tayoun, Derek Albracht, Jamie Allen, Alawi A Alsheikh-Ali, Casey Andrews, Dmitry Antipov, Lucinda Antonacci-Fulton, Mobin Asri, Marcelo Ayllon, Jennifer R Balacco, Edward A Belter Jr, Halle D Bender, Andrew P Blair, Silvia Buonaiuto, Davide Bolognini, Katherine E Bonini, Christina Boucher, Guillaume Bourque, Shuo Cao, Andrew Carroll, Ann M Mc Cartney, Monika Cechova, Pi-Chuan Chang, Xian Chang, Jitender Cheema, Haoyu Cheng, Claudio Ciofi, Sarah Cody, Vincenza Colonna, Holland C Conwell, Robert Cook-Deegan, Mark Diekhans, Maria Angela Diroma, Daniel Doerr, Zheng Dong, Richard Durbin, Jana Ebler, Evan E Eichler, Jordan M Eizenga, Parsa Eskandar, Eddie Ferro, Anna-Sophie Fiston-Lavier, Sarah M Ford, Willard W Ford, Giulio Formenti, Adam Frankish, Mallory A Freeberg, Qichen Fu, Stephanie M Fullerton, Robert S Fulton, Yan Gao, Gage H Garcia, Obed A Garcia, Joshua M V Gardner, Shilpa Garg, Erik Garrison, Nanibaa' A Garrison, John Garza, Mohammadmersad Ghorbani, Tina Graves-Lindsay, Richard E Green, Cristian Groza, Andrea Guarracino, Melissa Gymrek, Leanne Haggerty, Ira M Hall, Nancy F Hansen, Mohammad Amiruddin Hashmi, Maximilian Haeussler, David Haussler, Prajna Hebbar, Peter Heringer, Glenn Hickey, Todd L Hillaker, S Nakib Hossain, Neng Huang, Sarah E Hunt, Toby Hunt, Nafiseh Jafarzadeh, Nivesh Jain, Erich D Jarvis, Juan Jiang, Jonathan LoTempio Jr, Eimear E Kenny, Juhyun Kim, Bonhwang Koo, Sergey Koren, Milinn Kremitzki, Ben Langmead, Xiaoyu Zhuo, Heather A Lawson, Daofeng Li, Heng Li, Wen-Wei Liao, Jiadong Lin, Tianjie Liu, Glennis A Logsdon, Ryan Lorig-Roach, Hailey Loucks, Jane E Loveland, Jianguo Lu, Shuangjia Lu, Julian K Lucas, Juan F Macias-Velasco, Maximillian G Marin, Franco L Marsico, Kateryna D Makova, Christopher Markovic, Tobias Marschall, Fergal J Martin, Mira Mastoras, Capucine Mayoud, Brandy McNulty, Jack A Medico, Julian M Menendez, Karen H Miga, Anna Minkina, Matthew W Mitchell, Saswat K Mohanty, Younes Mokrab, Jean Monlong, Shabir Moosa, Avelina Moreno-Ochando, Shinichi Morishita, Jonathan M Mudge, Katherine M Munson, Njagi Mwaniki, Nasna Nassir, Chiara Natali, Shloka Negi, Lingbin Ni, Adam M Novak, Chie Owa, Sadye Paez, Benedict Paten, Hiram Clawson, Clelia Peano, Adam M Phillippy, Brandon D Pickett, Laura Pignata, Nadia Pisanti, David Porubsky, Pjotr Prins, Anandi Radhakrishnan, Brian J Raney, Mikko Rautiainen, Alessandro Raveane, Luyao Ren, Arang Rhie, Farnaz Salehi, Samuel Sacco, Michael C Schatz, Laura B Scheinfeldt, Aarushi Sehgal, William E Seligmann, Mahsa Shabani, Kishwar Shafin, Shadi Shahatit, Ruhollah Shemirani, Vikram S Shivakumar, Swati Sinha, Jouni Sirén, Linnéa Smeds, Steven J Solar, Marco Sollitto, Nicole Soranzo, Andrew B Stergachis, Marie-Marthe Suner, Yoshihiko Suzuki, Arda Söylev, Jack A S Tierney, Chad Tomlinson, Francesca Floriana Tricomi, Mohammed Uddin, Matteo Tommaso Ungaro, Rahul Varki, Flavia Villani, Mitchell R Vollger, Brian P Walenz, Charles Wang, Lisa E Wang, Ting Wang, Aaron M Wenger, Conor V Whelan, Zilan Xin, Zheng Xu, Kai Ye, DongAhn Yoo, Wenjin Zhang, Ying Zhou, Ivo Violich, Giulia Zunino

Affiliations

¹ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95060, USA.
² Google Incorporated, Mountain View, California 94043, USA.
³ Google Incorporated, Mountain View, California 94043, USA; awcarroll@google.com bpaten@ucsc.edu shafin@google.com.
⁴ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95060, USA; awcarroll@google.com bpaten@ucsc.edu shafin@google.com.

PMID: 40389286
PMCID: PMC12212083
DOI: 10.1101/gr.280149.124

Highly accurate assembly polishing with DeepPolisher

Mira Mastoras et al. Genome Res. 2025.

. 2025 Jul 1;35(7):1595-1608.

doi: 10.1101/gr.280149.124.

Collaborators

and the Human Pangenome Reference Consortium:
Ahmad Abou Tayoun, Derek Albracht, Jamie Allen, Alawi A Alsheikh-Ali, Casey Andrews, Dmitry Antipov, Lucinda Antonacci-Fulton, Mobin Asri, Marcelo Ayllon, Jennifer R Balacco, Edward A Belter Jr, Halle D Bender, Andrew P Blair, Silvia Buonaiuto, Davide Bolognini, Katherine E Bonini, Christina Boucher, Guillaume Bourque, Shuo Cao, Andrew Carroll, Ann M Mc Cartney, Monika Cechova, Pi-Chuan Chang, Xian Chang, Jitender Cheema, Haoyu Cheng, Claudio Ciofi, Sarah Cody, Vincenza Colonna, Holland C Conwell, Robert Cook-Deegan, Mark Diekhans, Maria Angela Diroma, Daniel Doerr, Zheng Dong, Richard Durbin, Jana Ebler, Evan E Eichler, Jordan M Eizenga, Parsa Eskandar, Eddie Ferro, Anna-Sophie Fiston-Lavier, Sarah M Ford, Willard W Ford, Giulio Formenti, Adam Frankish, Mallory A Freeberg, Qichen Fu, Stephanie M Fullerton, Robert S Fulton, Yan Gao, Gage H Garcia, Obed A Garcia, Joshua M V Gardner, Shilpa Garg, Erik Garrison, Nanibaa' A Garrison, John Garza, Mohammadmersad Ghorbani, Tina Graves-Lindsay, Richard E Green, Cristian Groza, Andrea Guarracino, Melissa Gymrek, Leanne Haggerty, Ira M Hall, Nancy F Hansen, Mohammad Amiruddin Hashmi, Maximilian Haeussler, David Haussler, Prajna Hebbar, Peter Heringer, Glenn Hickey, Todd L Hillaker, S Nakib Hossain, Neng Huang, Sarah E Hunt, Toby Hunt, Nafiseh Jafarzadeh, Nivesh Jain, Erich D Jarvis, Juan Jiang, Jonathan LoTempio Jr, Eimear E Kenny, Juhyun Kim, Bonhwang Koo, Sergey Koren, Milinn Kremitzki, Ben Langmead, Xiaoyu Zhuo, Heather A Lawson, Daofeng Li, Heng Li, Wen-Wei Liao, Jiadong Lin, Tianjie Liu, Glennis A Logsdon, Ryan Lorig-Roach, Hailey Loucks, Jane E Loveland, Jianguo Lu, Shuangjia Lu, Julian K Lucas, Juan F Macias-Velasco, Maximillian G Marin, Franco L Marsico, Kateryna D Makova, Christopher Markovic, Tobias Marschall, Fergal J Martin, Mira Mastoras, Capucine Mayoud, Brandy McNulty, Jack A Medico, Julian M Menendez, Karen H Miga, Anna Minkina, Matthew W Mitchell, Saswat K Mohanty, Younes Mokrab, Jean Monlong, Shabir Moosa, Avelina Moreno-Ochando, Shinichi Morishita, Jonathan M Mudge, Katherine M Munson, Njagi Mwaniki, Nasna Nassir, Chiara Natali, Shloka Negi, Lingbin Ni, Adam M Novak, Chie Owa, Sadye Paez, Benedict Paten, Hiram Clawson, Clelia Peano, Adam M Phillippy, Brandon D Pickett, Laura Pignata, Nadia Pisanti, David Porubsky, Pjotr Prins, Anandi Radhakrishnan, Brian J Raney, Mikko Rautiainen, Alessandro Raveane, Luyao Ren, Arang Rhie, Farnaz Salehi, Samuel Sacco, Michael C Schatz, Laura B Scheinfeldt, Aarushi Sehgal, William E Seligmann, Mahsa Shabani, Kishwar Shafin, Shadi Shahatit, Ruhollah Shemirani, Vikram S Shivakumar, Swati Sinha, Jouni Sirén, Linnéa Smeds, Steven J Solar, Marco Sollitto, Nicole Soranzo, Andrew B Stergachis, Marie-Marthe Suner, Yoshihiko Suzuki, Arda Söylev, Jack A S Tierney, Chad Tomlinson, Francesca Floriana Tricomi, Mohammed Uddin, Matteo Tommaso Ungaro, Rahul Varki, Flavia Villani, Mitchell R Vollger, Brian P Walenz, Charles Wang, Lisa E Wang, Ting Wang, Aaron M Wenger, Conor V Whelan, Zilan Xin, Zheng Xu, Kai Ye, DongAhn Yoo, Wenjin Zhang, Ying Zhou, Ivo Violich, Giulia Zunino

Affiliations

¹ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95060, USA.
² Google Incorporated, Mountain View, California 94043, USA.
³ Google Incorporated, Mountain View, California 94043, USA; awcarroll@google.com bpaten@ucsc.edu shafin@google.com.
⁴ UC Santa Cruz Genomics Institute, University of California, Santa Cruz, California 95060, USA; awcarroll@google.com bpaten@ucsc.edu shafin@google.com.

PMID: 40389286
PMCID: PMC12212083
DOI: 10.1101/gr.280149.124

Abstract

Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.

PubMed Disclaimer

Figures

**Figure 1.**
DeepPolisher pipeline overview. The PHARAOH pipeline leverages phase block information from ONT UL reads to correct the haplotype assignment of PacBio HiFi reads. The corrected alignment is passed to DeepPolisher, which is an encoder-only transformer model that predicts the underlying assembly sequence and proposes corrections in VCF format.

**Figure 2.**
Comparison of DeepPolisher and alternate polishing methods against GIAB v4.2.1 benchmark for HG005. (A) For each polishing method, GIAB v4.2.1 variant-calling (assembly) errors are separated by indels (darker shade) and single-nucleotide variants (SNVs; lighter shade), with the number of errors per megabase to the *right* of each bar. (B) Total GIAB variant-calling (assembly) errors for different HiFi read coverages, with indel errors represented in pink circles and SNV errors in yellow triangles. (C) Total GIAB variant-calling (assembly) errors stratified by presence in tandem repeats (*left*), homopolymers >7 bp (*middle*), and segmental duplications (segdups; *right*), with SNV errors in lighter shades and indel errors in darker shades.

**Figure 3.**
k-mer-based comparison of DeepPolisher and alternate polishing approaches for HG005. (A) *Top* panels display QV scores for each polishing method. *Bottom* panels depict total error k-mers, divided by error k-mers induced by polishing (dark blue) and error k-mers unchanged after polishing (green). *Left* panels show results for the GIAB high-confidence regions; *right* panels, whole genome. (B) Switch (x-axis) and hamming (y-axis) error rates for each polishing method. (C) Comparison of DeepVariant and DeepPolisher for eight HPRC samples. *Left* and *middle* panels show Hap1 (x-axis) and Hap2 (y-axis) QV for eight HPRC samples, with an arrow connecting the unpolished QV (pink) to the QV after polishing with DeepVariant (blue) and DeepPolisher (yellow). The *left* panel is within the GIAB high-confidence regions; *middle* panel, whole genome. The *right* panel shows the number of polishing edits from DeepPolisher (yellow) and DeepVariant (blue). Lighter shades indicate edits not inducing error (FP) k-mers; darker shades show edits that induce error k-mers. (D) Number of error k-mers unchanged by polishing with DeepPolisher falling into sequence annotation categories.

**Figure 4.**
Polishing results for 180 HPRC assemblies. (A) Hap1 QV (x-axis) and Hap2 QV (y-axis) in the high-confidence regions for 180 HPRC samples from the second release. For each sample, unpolished QV is in blue with an arrow pointing to the polished QV. (B) The same as A but for whole-genome QV. (C) Switch (x -axis) and hamming (y-axis) error rate for the 107 samples with trio data. Unpolished in pink with an arrow pointing to polished in yellow.

See this image and copyright information in PMC

Update of

Highly accurate assembly polishing with DeepPolisher.
Mastoras M, Asri M, Brambrink L, Hebbar P, Kolesnikov A, Cook DE, Nattestad M, Lucas J, Won TS, Chang PC, Carroll A, Paten B, Shafin K. Mastoras M, et al. bioRxiv [Preprint]. 2024 Sep 19:2024.09.17.613505. doi: 10.1101/2024.09.17.613505. bioRxiv. 2024. Update in: Genome Res. 2025 Jul 1;35(7):1595-1608. doi: 10.1101/gr.280149.124. PMID: 39345401 Free PMC article. Updated. Preprint.

References

1. Aganezov S, Yan SM, Soto DC, Kirsche M, Zarate S, Avdeyev P, Taylor DJ, Shafin K, Shumate A, Xiao C, et al. 2022. A complete reference genome improves analysis of human genetic variation. Science 376: eabl3533. 10.1126/science.abl3533 - DOI - PMC - PubMed
1. Baid G, Cook DE, Shafin K, Yun T, Llinares-López F, Berthet Q, Belyaeva A, Töpfer A, Wenger AM, Rowell WJ, et al. 2022. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41: 232–238. 10.1038/s41587-022-01435-7 - DOI - PubMed
1. Benjamini Y, Speed TP. 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40: e72. 10.1093/nar/gks001 - DOI - PMC - PubMed
1. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. 2022. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38: 2102–2110. 10.1093/bioinformatics/btac020 - DOI - PMC - PubMed
1. Carroll A, Kolesnikov A, Cook DE, Brambrink L, Wiseman KN, Billings SM, Kruglyak S, Lajoie BR, Zhao J, Levy SE, et al. 2023. Accurate human genome analysis with element avidity sequencing. bioRxiv 10.1101/2023.08.11.553043 - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Highly accurate assembly polishing with DeepPolisher

Collaborators

Affiliations

Highly accurate assembly polishing with DeepPolisher

Authors

Collaborators

Affiliations

Abstract

Figures

Update of

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources