Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 18;25(1):312.
doi: 10.1186/s13059-024-03452-y.

Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

Affiliations

Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

Prasad Sarashetti et al. Genome Biol. .

Abstract

Background: Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT. Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges.

Results: Our study evaluates recommended data types and volumes required to establish a robust de novo genome assembly pipeline for population-level pangenome projects, extensively examining performance between ONT's Duplex and PacBio HiFi datasets in the context of achieving high-quality phased genomes with enhanced contiguity and completeness. The results show that achieving chromosome-level haplotype-resolved assembly requires 20 × high-quality long reads such as PacBio HiFi or ONT Duplex, combined with 15-20 × of ultra-long ONT per haplotype and 10 × of long-range data such as Omni-C or Hi-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in phasing accuracies, while Duplex generates more T2T contigs.

Conclusion: Our study provides insights into optimal data types and volumes for robust de novo genome assembly in population-level pangenome projects. Reassessing the recommended data types and volumes in this study and aligning them with practical economic limitations are vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.

Keywords: De novo assembly; LRS special issue; Pangenome; Population-level studies; Sequencing platforms.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: All the participants were provided with informed consent for sample collection, and usage including making data publicly available via databases. Sample collection and usage were approved by SingHealth Centralised Institutional Review Board (IRB Reference: 2024–069). All experimental methods comply with the Helsinki Declaration. Consent for publication: Not applicable. Competing interests: M.Š. has been jointly funded by Oxford Nanopore Technologies and AI Singapore for the project AI-driven De Novo Diploid Assembler. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Comparison of read length and quality (Phred scale) between PacBio HiFi and ONT Duplex reads. A Distribution of read length vs quality of HiFi and Duplex reads with vertical dotted lines indicating the average lengths: 17 kbp for HiFi and 29.5 kbp for Duplex reads. On average, more than 50% of both Duplex and HiFi reads have quality scores ≥ Q30, a general cutoff for high-quality reads, indicated by the horizontal dotted line. B Comparison of read quality among ONT Simplex, ONT Duplex, and PacBio HiFi, with vertical dotted lines representing average quality scores: Q16 for Simplex, Q29 for Duplex, and Q32 for HiFi reads. C Percentage of reads with a quality score of Q30 and higher (dotted line). On average, 63% of HiFi reads and 57% of Duplex reads have a quality score of Q30 and higher
Fig. 2
Fig. 2
Comparison of assembly performance versus data coverage. “HQLR_Only” denotes assemblies generated solely with HiFi or Duplex data across various coverages. “HQLR + ULONT” signifies assemblies generated with a saturation coverage (35 ×) of HiFi or Duplex data combined with various ULONT coverages
Fig. 3
Fig. 3
Comparison of phasing accuracies of different assemblies (Duplex assemblies—top row, HiFi assemblies—bottom row). a Phasing accuracy of dual assembly generated from HQLR-only (HiFi/Duplex). b Phasing accuracy of dual assembly in conjunction with ULONT. c Haplotype separated assemblies with Omni-C/Hi-C data. Each circle denotes a contig, size reflecting its length. Circle’s positions are determined by the number of maternal and paternal k-mers derived from high-quality short reads on respective contigs. Contigs positioned along the axis indicate higher phasing accuracy
Fig. 4
Fig. 4
Assessment of gene completeness analysis. a Output from the assemblies generated from HQLR-only. b Output from assemblies generated from HQLR + ULONT
Fig. 5
Fig. 5
Assessment of k-mer-based genome completeness analysis
Fig. 6
Fig. 6
K-mer-based genome quality scores
Fig. 7
Fig. 7
Computation resources consumed by hifiasm across different data types and coverages
Fig. 8
Fig. 8
The relative performance of HiFi vs Duplex assemblies from hifiasm and Verkko using I002C and HG002 dataset with plateau coverage, i.e., (35 × HiFi vs 35 × Duplex) + (30 × ULONT and 10 × Omni-C/Hi-C). For each assembly feature, we compared 24 assemblies obtained from 2 samples, 2 assemblers, 3 replicates, and 2 haplotypes, recording instances where the HiFi assembly outperformed the Duplex assembly and vice versa. “hifiasm + Verkko” shows the overall performance of HiFi vs Duplex across 24 assemblies. For example, 23 out of 24 HiFi assemblies had a lower switch error compared to Duplex assemblies, while 19 Duplex assemblies contained more T2T contigs than HiFi assemblies. Both platforms, however, produced comparable results in terms of NG50 and the length of the longest contigs

Similar articles

References

    1. International Human Genome Sequencing Consortium, Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. 10.1038/35057062. - PubMed
    1. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–64. 10.1101/gr.213611.116. - PMC - PubMed
    1. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376:44–53. 10.1126/science.abj6987. - PMC - PubMed
    1. He Y, Chu Y, Guo S, Hu J, Li R, Zheng Y, et al. T2T-YAO: a telomere-to-telomere assembled diploid reference genome for Han Chinese. Genomics Proteomics Bioinformatics. 2023. 10.1016/j.gpb.2023.08.001. - PMC - PubMed
    1. Yang C, Zhou Y, Song Y, Wu D, Zeng Y, Nie L, et al. The complete and fully-phased diploid genome of a male Han Chinese. Cell Res. 2023;33:745–61. 10.1038/s41422-023-00849-5. - PMC - PubMed

LinkOut - more resources