Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

Prasad Sarashetti¹, Josipa Lipovac², Filip Tomas², Mile Šikić^{3

4}, Jianjun Liu^{5

6}

Affiliations

¹ Laboratory of Human Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore.
² Laboratory for Bioinformatics and Computational Biology, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia.
³ Laboratory for Bioinformatics and Computational Biology, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia. miles@gis.a-star.edu.sg.
⁴ Laboratory of AI in Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore. miles@gis.a-star.edu.sg.
⁵ Laboratory of Human Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore. liuj3@gis.a-star.edu.sg.
⁶ Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore. liuj3@gis.a-star.edu.sg.

PMID: 39696427
PMCID: PMC11658127
DOI: 10.1186/s13059-024-03452-y

Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

Prasad Sarashetti et al. Genome Biol. 2024.

. 2024 Dec 18;25(1):312.

doi: 10.1186/s13059-024-03452-y.

Authors

Prasad Sarashetti¹, Josipa Lipovac², Filip Tomas², Mile Šikić^{3

4}, Jianjun Liu^{5

6}

Affiliations

¹ Laboratory of Human Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore.
² Laboratory for Bioinformatics and Computational Biology, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia.
³ Laboratory for Bioinformatics and Computational Biology, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia. miles@gis.a-star.edu.sg.
⁴ Laboratory of AI in Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore. miles@gis.a-star.edu.sg.
⁵ Laboratory of Human Genomics, Genome Institute of Singapore, A*STAR, Singapore, Singapore. liuj3@gis.a-star.edu.sg.
⁶ Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore. liuj3@gis.a-star.edu.sg.

PMID: 39696427
PMCID: PMC11658127
DOI: 10.1186/s13059-024-03452-y

Abstract

Background: Long-read technologies from Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have transformed genomics research by providing diverse data types like HiFi, Duplex, and ultra-long ONT. Despite recent strides in achieving haplotype-phased gapless genome assemblies using long-read technologies, concerns persist regarding the representation of genetic diversity, prompting the development of pangenome references. However, pangenome studies face challenges related to data types, volumes, and cost considerations for each assembled genome, while striving to maintain sensitivity. The absence of comprehensive guidance on optimal data selection exacerbates these challenges.

Results: Our study evaluates recommended data types and volumes required to establish a robust de novo genome assembly pipeline for population-level pangenome projects, extensively examining performance between ONT's Duplex and PacBio HiFi datasets in the context of achieving high-quality phased genomes with enhanced contiguity and completeness. The results show that achieving chromosome-level haplotype-resolved assembly requires 20 × high-quality long reads such as PacBio HiFi or ONT Duplex, combined with 15-20 × of ultra-long ONT per haplotype and 10 × of long-range data such as Omni-C or Hi-C. High-quality long reads from both platforms yield assemblies with comparable contiguity, with HiFi excelling in phasing accuracies, while Duplex generates more T2T contigs.

Conclusion: Our study provides insights into optimal data types and volumes for robust de novo genome assembly in population-level pangenome projects. Reassessing the recommended data types and volumes in this study and aligning them with practical economic limitations are vital to the pangenome research community, contributing to their efforts and pushing genomic studies with broader impacts.

Keywords: De novo assembly; LRS special issue; Pangenome; Population-level studies; Sequencing platforms.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: All the participants were provided with informed consent for sample collection, and usage including making data publicly available via databases. Sample collection and usage were approved by SingHealth Centralised Institutional Review Board (IRB Reference: 2024–069). All experimental methods comply with the Helsinki Declaration. Consent for publication: Not applicable. Competing interests: M.Š. has been jointly funded by Oxford Nanopore Technologies and AI Singapore for the project AI-driven De Novo Diploid Assembler. The remaining authors declare no competing interests.

Figures

**Fig. 1**
Comparison of read length and quality (Phred scale) between PacBio HiFi and ONT Duplex reads. A Distribution of read length vs quality of HiFi and Duplex reads with vertical dotted lines indicating the average lengths: 17 kbp for HiFi and 29.5 kbp for Duplex reads. On average, more than 50% of both Duplex and HiFi reads have quality scores ≥ Q30, a general cutoff for high-quality reads, indicated by the horizontal dotted line. B Comparison of read quality among ONT Simplex, ONT Duplex, and PacBio HiFi, with vertical dotted lines representing average quality scores: Q16 for Simplex, Q29 for Duplex, and Q32 for HiFi reads. C Percentage of reads with a quality score of Q30 and higher (dotted line). On average, 63% of HiFi reads and 57% of Duplex reads have a quality score of Q30 and higher

**Fig. 2**
Comparison of assembly performance versus data coverage. “HQLR_Only” denotes assemblies generated solely with HiFi or Duplex data across various coverages. “HQLR + ULONT” signifies assemblies generated with a saturation coverage (35 ×) of HiFi or Duplex data combined with various ULONT coverages

**Fig. 3**
Comparison of phasing accuracies of different assemblies (Duplex assemblies—top row, HiFi assemblies—bottom row). a Phasing accuracy of dual assembly generated from HQLR-only (HiFi/Duplex). b Phasing accuracy of dual assembly in conjunction with ULONT. c Haplotype separated assemblies with Omni-C/Hi-C data. Each circle denotes a contig, size reflecting its length. Circle’s positions are determined by the number of maternal and paternal k-mers derived from high-quality short reads on respective contigs. Contigs positioned along the axis indicate higher phasing accuracy

**Fig. 4**
Assessment of gene completeness analysis. a Output from the assemblies generated from HQLR-only. b Output from assemblies generated from HQLR + ULONT

**Fig. 5**
Assessment of k-mer-based genome completeness analysis

**Fig. 6**
K-mer-based genome quality scores

**Fig. 7**
Computation resources consumed by hifiasm across different data types and coverages

**Fig. 8**
The relative performance of HiFi vs Duplex assemblies from hifiasm and Verkko using I002C and HG002 dataset with plateau coverage, i.e., (35 × HiFi vs 35 × Duplex) + (30 × ULONT and 10 × Omni-C/Hi-C). For each assembly feature, we compared 24 assemblies obtained from 2 samples, 2 assemblers, 3 replicates, and 2 haplotypes, recording instances where the HiFi assembly outperformed the Duplex assembly and vice versa. “hifiasm + Verkko” shows the overall performance of HiFi vs Duplex across 24 assemblies. For example, 23 out of 24 HiFi assemblies had a lower switch error compared to Duplex assemblies, while 19 Duplex assemblies contained more T2T contigs than HiFi assemblies. Both platforms, however, produced comparable results in terms of NG50 and the length of the longest contigs

See this image and copyright information in PMC

References

1. International Human Genome Sequencing Consortium, Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. 10.1038/35057062. - DOI - PubMed
1. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–64. 10.1101/gr.213611.116. - DOI - PMC - PubMed
1. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376:44–53. 10.1126/science.abj6987. - DOI - PMC - PubMed
1. He Y, Chu Y, Guo S, Hu J, Li R, Zheng Y, et al. T2T-YAO: a telomere-to-telomere assembled diploid reference genome for Han Chinese. Genomics Proteomics Bioinformatics. 2023. 10.1016/j.gpb.2023.08.001. - PMC - PubMed
1. Yang C, Zhou Y, Song Y, Wu D, Zeng Y, Nie L, et al. The complete and fully-phased diploid genome of a male Han Chinese. Cell Res. 2023;33:745–61. 10.1038/s41422-023-00849-5. - DOI - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

Affiliations

Evaluating data requirements for high-quality haplotype-resolved genomes for creating robust pangenome references

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources