. 2022 Jun;19(6):687-695.

doi: 10.1038/s41592-022-01440-3. Epub 2022 Mar 31.

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Ann M Mc Cartney^#¹, Kishwar Shafin^#², Michael Alonge^#³, Andrey V Bzikadze⁴, Giulio Formenti⁵, Arkarachai Fungtammasan⁶, Kerstin Howe⁷, Chirag Jain^{1

8}, Sergey Koren¹, Glennis A Logsdon⁹, Karen H Miga^{2

10}, Alla Mikheenko¹¹, Benedict Paten², Alaina Shumate¹², Daniela C Soto¹³, Ivan Sović^{14

15}, Jonathan M D Wood⁷, Justin M Zook¹⁶, Adam M Phillippy¹⁷, Arang Rhie¹⁸

Affiliations

¹ Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
² UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
⁴ Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA.
⁵ Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA.
⁶ DNAnexus, Mountain View, CA, USA.
⁷ Wellcome Sanger Institute, Cambridge, UK.
⁸ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India.
⁹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁰ Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA.
¹¹ Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia.
¹² Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
¹³ Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA.
¹⁴ Pacific Biosciences, Menlo Park, CA, USA.
¹⁵ Digital BioLogic d.o.o., Ivanić-Grad, Croatia.
¹⁶ Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA.
¹⁷ Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA. adam.phillippy@nih.gov.
¹⁸ Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA. arang.rhie@nih.gov.

^# Contributed equally.

PMID: 35361931
PMCID: PMC9812399
DOI: 10.1038/s41592-022-01440-3

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Ann M Mc Cartney et al. Nat Methods. 2022 Jun.

. 2022 Jun;19(6):687-695.

doi: 10.1038/s41592-022-01440-3. Epub 2022 Mar 31.

Authors

Affiliations

¹ Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
² UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
³ Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
⁴ Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA.
⁵ Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA.
⁶ DNAnexus, Mountain View, CA, USA.
⁷ Wellcome Sanger Institute, Cambridge, UK.
⁸ Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India.
⁹ Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.
¹⁰ Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA.
¹¹ Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia.
¹² Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
¹³ Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA.
¹⁴ Pacific Biosciences, Menlo Park, CA, USA.
¹⁵ Digital BioLogic d.o.o., Ivanić-Grad, Croatia.
¹⁶ Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA.
¹⁷ Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA. adam.phillippy@nih.gov.
¹⁸ Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA. arang.rhie@nih.gov.

^# Contributed equally.

PMID: 35361931
PMCID: PMC9812399
DOI: 10.1038/s41592-022-01440-3

Abstract

Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.

PubMed Disclaimer

Figures

**Extended Data Fig. 1 |. Sequencing biases observed in missing k-mers.**
a, missing k-mers with its GA composition. **b-d**, v0.9 assembly and k-mer copy number spectrum from HiFi, Illumina, and hybrid k-mer sets (left) and per-chromosome missing (likely error) k-mer counts from the HiFi derived consensus or patches (right). Most missing k-mers in HiFi overlapped sequences from patched regions. No missing k-mer was found on chromosomes indicated with red arrows.

**Extended Data Fig. 2 |. Error detection and polishing pipeline.**
A detailed overview of the polishing pipeline along with the number of errors identified and polished at each step. Additionally, data type and polishing tools utilized are highlighted. Illumina, 100X PCR-free library Illumina reads; HiFi, 35x PacBio HiFi reads; ONT, 120x Oxford Nanopore reads.

**Extended Data Fig. 3 |. Number of SV-like errors and globally unique single copy k-mers used for marker assisted alignment.**
a. Number of SV-like errors called from long-read platforms. b. Range of k-mer counts defined as ‘single-copy’ markers from Illumina reads and in the assembly. The cutoffs were chosen to minimize inclusion of low-frequency erroneous k-mers and 2-copy k-mers. c. Number of markers in every 10 kb window. d. Cumulative number of bases covered by the number of markers in each 10 kb window.

**Extended Data Fig. 4 |. Post-polishing evaluation.**
a. Left, genotype quality and number of reads supporting the reference and alternate alleles from the combined Illumina-hifi hybrid and ONT homozygous variant calls, with AF > 0.5. Right, balanced insertion (red) and deletion (blue) length distribution from the Illumina-HiFi hybrid DeepVariant heterozygous calls in CHM13v1.0. b. Number of errors detected in each chromosome, before and after polishing. c. Polishing inside and outside of repeats. The distribution of CHM13v0.9 polishing rates within and without repeats.

**Extended Data Fig. 5 |. Three SV-like errors corrected.**
HiFi and ONT marker assisted alignments, post correction of the 3 large SV-like edits visualized with IGV. HiFi coverage track is shown in data range up to 60, ONT up to 150. Clipped reads are flagged for >100bp. INDELs smaller than 10 bp are not shown. Reads are colored by strands; positive in red and negative in blue.

**Extended Data Fig. 6 |. Telomere polishing.**
a. An illustration of Chr. 2 telomere sequence reads from HiFi, ONT and CLR platform. b. Distribution of maximum perfect match to the canonical k-mer observed at each position in the telomere before (CHM13v1.0) and after (CHM13v1.1) polishing the telomeres.

**Extended Data Fig. 7 |. Mapping biases found and corrected.**
On simulated HiFi reads, we found excessive clippings in highly identical satellite repeats in Minimap and Winnowmap by the time of evaluation. We have addressed this issue in Winnowmap 2.01+. Clipped (%) indicates the percentage of reads clipped in every 1024 bp window, shown in 0~40% range with a midline of 10%.

**Extended Data Fig. 8 |. HiFi, CLR, ONT read coverage, alignment identity, and read length from Winnowmap2 v2.01 alignments and Bionano DLE-1 molecule coverage from Bionano Solve.**
Upper panel shows a zoomed in region of Chromosome 9, while the upper panel shows the whole-genome alignment view. HiFi, CLR, ONT, and Bionano coverage are shown up to 70x, 70x, 200x, and 250x, respectively. Median read identity in every 1024 bp is shown in 80–100% range. Median read length in every 1024 bp is shown in 0–100kb range. Read identity was the worst in CLR, and between HiFi and ONT. Bionano molecules were lacking coverage in most of the centromeric repeats.

**Extended Data Fig. 9 |. Collapsed simple tandem repeat.**
The collapse in the Intronic sequences of gene *FAM227A* was undetected, due to the variable insertion breakpoints and insertion length in the HiFi and ONT alignments. The panels above the alignments show marker density and percent microsatellites (GA / AT / TC / GC) in each 64 bp window, which indicates this region is highly repetitive with GA enriched sequences, which later alternates with AT enriched sequences.

**Extended Data Fig. 10 |. Chimeric junction of two haplotypes.**
In the shown above regions, both HiFi and ONT reads indicate that the consensus has a chimeric junction of the two haplotypes.

**Figure 1 |. An overview of the evaluation and polishing strategy developed to achieve a complete human genome assembly.**
a, The evaluation strategies used to assess genome assembly accuracy before (CHM13v0.9) and after (CHM13v1.0 and CHM13v1.1) polishing. b, The “do no harm” polishing strategy developed and implemented to generate CHM13v1.0 and CHM13v1.1.

**Figure 2 |. Sequencing biases in PacBio HiFi and Illumina reads.**
a, Venn Diagram of the “missing” k-mers found in the assembly but not in the HiFi reads (green) or Illumina reads (blue). Except for the 1,094 k-mers that were absent from both HiFi and Illumina reads, error k-mers were found in the other sequencing platform with expected frequency, matching the average sequencing coverage (lower panels). b, Missing k-mers from a with its GC contents, colored by the frequency observed. Low frequency erroneous k-mers did not have a clear GC bias. k-mers found only in HiFi had a higher GC percentage, while higher frequency k-mers tend to have more AT rich sequences in Illumina. c, Homopolymer length distribution observed in the assembly and in HiFi reads (upper) or Illumina reads (lower) aligned to that position. Longer homopolymers in the consensus are associated with length variability in HiFi reads especially in the GC homopolymers. The majority of the Illumina reads were concordant with the consensus.

**Figure 3 |. Errors corrected after polishing.**
a, Three corrected SV-like errors. b, Bionano optical maps indicating the missing telomeric sequence on Chr. 18 p-arm (left) with a higher than average mapping coverage. This excessive coverage was removed after adding the missing telomeric sequence (right) and most of the Bionano molecules end at the end of the sequence. c, Variant allele frequency (VAF) of each variant called by DeepVariant hybrid (HiFi + Illumina) mode, before and after polishing. Most of the high frequency variants (errors) are removed after polishing, which were called ‘Homozygous’ variants. d, Total number of reads in each observed length difference (bp) between the assembly and the aligned reads at each edit position. Positive numbers indicate more bases are found in the reads, while negative numbers indicate fewer bases in the reads. Both the homopolymer and micro-satellite (2-mers in homopolymer compressed space) length difference became 0 after polishing.

**Figure 4 |. Examples of the largest CHM13 regions with a copy number in the reference that differs from GRCh38 and most individuals.**
a, One of the two largest examples of rare collapses in CHM13, where one copy of a common 72 kb tandem duplication is absent in CHM13. b, The largest rare duplication in CHM13, a 142 kb tandem duplication of sequence in GRCh38 that is rare in the population. CHM13 and HG002 PacBio HiFi coverage tracks are displayed for both references, GRCh38 (top) and CHM13v1.0 (bottom), to demonstrate that CHM13 reads support the CHM13 copy-number but HG002 reads are consistent with the GRCh38 copy-number. Read-depth copy-number estimates in CHM13 are shown at the bottom for ‘k-merized’ versions of GRCh38 and CHM13v1.0 references, CHM13 Illumina reads, and Illumina reads from a diverse subset (n=34) of SGDP individuals.

**Figure 5 |. Errors made by automated polishing.**
a, The distribution of the number of polishing edits made in non-overlapping 1 Mb windows of the CHM13v0.9 assembly. b, Two Racon polishing edits causing false frameshift errors in the *FAM156B* gene. Light blue indicates UTR and dark blue indicates the single coding sequence exon. Highlighted sequence indicates GC-rich homopolymers.

See this image and copyright information in PMC

Comment in

Polishing high-quality genome assemblies.
Fang L, Wang K. Fang L, et al. Nat Methods. 2022 Jun;19(6):649-650. doi: 10.1038/s41592-022-01515-1. Nat Methods. 2022. PMID: 35610477 No abstract available.

References

1. Nurk S, Koren S, Rhie A, Rautiainen M, et al. The complete sequence of a human genome. Science (2022). doi: 10.1126/science.abj6987 - DOI - PMC - PubMed
1. Vollger MR, et al. Segmental duplications and their variation in a complete human genome. Science (2022). doi: 10.1126/science.abj6965 - DOI - PMC - PubMed
1. Gershman A, et al. Epigenetic Patterns in a Complete Human Genome. Science (2022). doi: 10.1126/science.abj5089 - DOI - PMC - PubMed
1. Ebert P et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, (2021). - PMC - PubMed
1. Hufford MB et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science (2021) 10.1126/science.abg5289 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Affiliations

Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies

Authors

Affiliations

Abstract

Figures

Comment in

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources