Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references
- PMID: 39696427
- PMCID: PMC11658127
- DOI: 10.1186/s13059-024-03452-y
Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references
Abstract
Background: Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT. Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges.
Results: Our study evaluates recommended data types and volumes required to establish a robust de novo genome assembly pipeline for population-level pangenome projects, extensively examining performance between ONT's Duplex and PacBio HiFi datasets in the context of achieving high-quality phased genomes with enhanced contiguity and completeness. The results show that achieving chromosome-level haplotype-resolved assembly requires 20 × high-quality long reads such as PacBio HiFi or ONT Duplex, combined with 15-20 × of ultra-long ONT per haplotype and 10 × of long-range data such as Omni-C or Hi-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in phasing accuracies, while Duplex generates more T2T contigs.
Conclusion: Our study provides insights into optimal data types and volumes for robust de novo genome assembly in population-level pangenome projects. Reassessing the recommended data types and volumes in this study and aligning them with practical economic limitations are vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.
Keywords: De novo assembly; LRS special issue; Pangenome; Population-level studies; Sequencing platforms.
© 2024. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: All the participants were provided with informed consent for sample collection, and usage including making data publicly available via databases. Sample collection and usage were approved by SingHealth Centralised Institutional Review Board (IRB Reference: 2024–069). All experimental methods comply with the Helsinki Declaration. Consent for publication: Not applicable. Competing interests: M.Š. has been jointly funded by Oxford Nanopore Technologies and AI Singapore for the project AI-driven De Novo Diploid Assembler. The remaining authors declare no competing interests.
Figures








Similar articles
-
Gapless assembly of complete human and plant chromosomes using only nanopore sequencing.bioRxiv [Preprint]. 2024 Mar 19:2024.03.15.585294. doi: 10.1101/2024.03.15.585294. bioRxiv. 2024. Update in: Genome Res. 2024 Nov 20;34(11):1919-1930. doi: 10.1101/gr.279334.124. PMID: 38529488 Free PMC article. Updated. Preprint.
-
Comparison of the two up-to-date sequencing technologies for genome assembly: HiFi reads of Pacific Biosciences Sequel II system and ultralong reads of Oxford Nanopore.Gigascience. 2020 Dec 15;9(12):giaa123. doi: 10.1093/gigascience/giaa123. Gigascience. 2020. PMID: 33319909 Free PMC article.
-
Evaluating long-read de novo assembly tools for eukaryotic genomes: insights and considerations.Gigascience. 2022 Dec 28;12:giad100. doi: 10.1093/gigascience/giad100. Epub 2023 Nov 24. Gigascience. 2022. PMID: 38000912 Free PMC article.
-
Perspectives and opportunities in forensic human, animal, and plant integrative genomics in the Pangenome era.Forensic Sci Int. 2025 Feb;367:112370. doi: 10.1016/j.forsciint.2025.112370. Epub 2025 Jan 12. Forensic Sci Int. 2025. PMID: 39813779 Review.
-
Advancements in long-read genome sequencing technologies and algorithms.Genomics. 2024 May;116(3):110842. doi: 10.1016/j.ygeno.2024.110842. Epub 2024 Apr 11. Genomics. 2024. PMID: 38608738 Review.
References
-
- International Human Genome Sequencing Consortium, Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. 10.1038/35057062. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources