Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Sep 19:2024.09.17.613505.
doi: 10.1101/2024.09.17.613505.

Highly accurate assembly polishing with DeepPolisher

Affiliations

Highly accurate assembly polishing with DeepPolisher

Mira Mastoras et al. bioRxiv. .

Update in

  • Highly accurate assembly polishing with DeepPolisher.
    Mastoras M, Asri M, Brambrink L, Hebbar P, Kolesnikov A, Cook DE, Nattestad M, Lucas J, Won TS, Chang PC, Carroll A, Paten B, Shafin K; and the Human Pangenome Reference Consortium. Mastoras M, et al. Genome Res. 2025 Jul 1;35(7):1595-1608. doi: 10.1101/gr.280149.124. Genome Res. 2025. PMID: 40389286

Abstract

Accurate genome assemblies are essential for biological research, but even the highest quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over-and under-polishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacbio HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHARAOH (Phasing Reads in Areas Of Homozygosity), which uses ultra-long ONT data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by half, with a greater than 70% reduction in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted Quality Value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest A.C, D.E.C, P.C., A.K., L.B., M.N. and K.S. are employees of Google LLC and own Alphabet stock as part of the standard compensation package.

Figures

Figure 1:
Figure 1:. DeepPolisher pipeline overview.
The PHARAOH pipeline leverages phase block information from ONT UL reads to correct the haplotype assignment of PacBio HiFi reads. The corrected alignment is passed to DeepPolisher, which is an encoder-only transformer model that predicts the underlying assembly sequence and proposes corrections in vcf format.
Figure 2:
Figure 2:. Comparison of DeepPolisher and alternate polishing methods against GIAB v4.2.1 benchmark for HG005
A) For each polishing method, GIAB v4.2.1 variant calling (assembly) errors are separated by indels (darker shade) and single nucleotide variants (SNVs) (lighter shade), with the number of errors per megabase to the right of each bar. B) Total GIAB variant calling (assembly) errors for different HiFi read coverages, with indel errors represented in pink circles and SNV errors in yellow triangles. C) Total GIAB variant calling (assembly) errors stratified by presence in tandem repeats (left), homopolymers > 7bp (middle) and segmental duplications (segdups) (right), with SNV errors in lighter shades and indel errors in darker shades.
Figure 3:
Figure 3:. K-mer based comparison of DeepPolisher and alternate polishing approaches for HG005
A) Top panels display QV scores for each polishing method. Bottom panels depict total error k-mers, divided by error k-mers induced by polishing (dark blue) and error k-mers unchanged after polishing (green). Left panels show results for the GIAB confidence regions, right panels whole genome. B) Switch (x axis) and hamming (y axis) error rates for each polishing method. C) Comparison of DeepVariant and DeepPolisher for 8 HPRC samples. Left and middle panels show Hap1 (x axis) and Hap2 (y axis) QV for 8 HPRC samples, with an arrow connecting the unpolished QV (pink) to the QV after polishing with DeepVariant (blue) and DeepPolisher (yellow). Left panel is within the GIAB confidence regions, middle panel whole genome. Right panel shows number of polishing edits from DeepPolisher (yellow) and DeepVariant (blue). Lighter shades indicate edits not inducing error (FP) k-mers, darker shades show edits that induce error k-mers. D) Number of error k-mers unchanged by polishing with DeepPolisher falling into sequence annotation categories.
Figure 4:
Figure 4:. Polishing results for 180 HPRC assemblies
A) Hap1 QV (x axis) and Hap2 QV (y axis) in the confidence regions for 180 HPRC samples from the second release. For each sample, unpolished QV is in blue with an arrow pointing to the polished QV. B) The same as A) but for whole genome QV. C) Switch (x axis) and hamming (y axis) error rate for the 107 samples with trio data. Unpolished in pink with an arrow pointing to polished in yellow.

References

    1. Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet. 2024;25(9):658–670. doi: 10.1038/s41576-024-00718-w - DOI - PubMed
    1. Taylor DJ, Eizenga JM, Li Q, et al. Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References. Annu Rev Genomics Hum Genet. 2024;25(Volume 25, 2024):77–104. doi: 10.1146/annurev-genom-021623-081639 - DOI - PMC - PubMed
    1. Porubsky D, Dashnow H, Sasani TA, et al. A familial, telomere-to-telomere reference for human de novo mutation and recombination from a four-generation pedigree. Published online August 5, 2024:2024.08.05.606142. doi: 10.1101/2024.08.05.606142 - DOI
    1. Cheng H, Asri M, Lucas J, Koren S, Li H. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat Methods. 2024;21(6):967–970. doi: 10.1038/s41592-024-02269-8 - DOI - PMC - PubMed
    1. Rautiainen M, Nurk S, Walenz BP, et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol. 2023;41(10):1474–1482. doi: 10.1038/s41587-023-01662-6 - DOI - PMC - PubMed

Publication types

LinkOut - more resources