Cluster-efficient pangenome graph construction with nf-core/pangenome

Simon Heumos^{1

2

3

4}, Michael L Heuer⁵, Friederike Hanssen^{1

2

3

4}, Lukas Heumos^{6

7

8}, Andrea Guarracino^{9

10}, Peter Heringer^{1

2

3

4}, Philipp Ehmele⁶, Pjotr Prins⁹, Erik Garrison⁹, Sven Nahnsen^{1

2

3

4}

Affiliations

¹ Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, 72076, Germany.
² Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, 72076, Germany.
³ M3 Research Center, University Hospital Tübingen, Tübingen, 72076, Germany.
⁴ Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, 72076, Germany.
⁵ University of California, Berkeley, Berkeley, CA 94720, United States.
⁶ Department of Computational Health, Institute of Computational Biology, Helmholtz Munich, Munich, 85764, Germany.
⁷ Comprehensive Pneumology Center with the CPC-M bioArchive, Helmholtz Zentrum Munich, Member of the German Center for Lung Research (DZL), Munich, 81377, Germany.
⁸ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, 81377, Germany.
⁹ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States.
¹⁰ Human Technopole, Milan 20157, Italy.

PMID: 39400346
PMCID: PMC11568064
DOI: 10.1093/bioinformatics/btae609

Cluster-efficient pangenome graph construction with nf-core/pangenome

Simon Heumos et al. Bioinformatics. 2024.

. 2024 Nov 1;40(11):btae609.

doi: 10.1093/bioinformatics/btae609.

Authors

Affiliations

¹ Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, 72076, Germany.
² Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, 72076, Germany.
³ M3 Research Center, University Hospital Tübingen, Tübingen, 72076, Germany.
⁴ Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, 72076, Germany.
⁵ University of California, Berkeley, Berkeley, CA 94720, United States.
⁶ Department of Computational Health, Institute of Computational Biology, Helmholtz Munich, Munich, 85764, Germany.
⁷ Comprehensive Pneumology Center with the CPC-M bioArchive, Helmholtz Zentrum Munich, Member of the German Center for Lung Research (DZL), Munich, 81377, Germany.
⁸ TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, 81377, Germany.
⁹ Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States.
¹⁰ Human Technopole, Milan 20157, Italy.

PMID: 39400346
PMCID: PMC11568064
DOI: 10.1093/bioinformatics/btae609

Abstract

Motivation: Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time.

Results: To overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core's best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions.

Availability and implementation: nf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.

PubMed Disclaimer

Conflict of interest statement

Author L.H. is employed by LaminLabs.

Figures

**Figure 1.**
(a) Schematic representation of the nf-core/pangenome workflow processes and detailed analysis steps. The input consists of one FASTA file containing all sequences. The pipeline comes with three major entry points: (1) community detection, which identifies clusters of related sequences or regions in the pangenome graph to reveal biologically significant patterns like conserved or divergent areas across genomes (Supplementary Material 5.2), (2) alignment distribution, and (3) core workflow. Optional community detection (1) is performed on the input sequences. If selected, the heavy all-to-all base-pair level alignments (2) can be split into problems of equal size. nf-core/pangenome’s core workflow (3) is a direct mirror of PGGB. If running in community mode, all communal graphs are combined into one (4) and the subsequent quality control subworkflow is executed. The output is a pangenome graph in GFA format. (b, c) Pangenome growth curves of the built pangenome graphs. Growth type is defined as the minimum fraction of haplotypes that must share a graph feature after each time a haplotype is added to the growth histograph. $quorum > = 0$ : all sequences without any filtering are considered. $quorum > = 10$ : sequences traversed by at least 10% of the haplotypes. $quorum > = 50$ : sequences traversed by at least 50% of haplotypes. $quorum > = 95$ : sequences traversed by 95% of haplotypes. (b) Pangenome growth curve of the chromosome 19 pangenome graph of 1000 haplotypes. (c) Pangenome growth curve of the *Escherichia coli* pangenome graph of 2013 haplotypes.

See this image and copyright information in PMC

References

1. Andreace F, Lechat P, Dufresne Y. et al. Comparing methods for constructing and representing human pangenome graphs. Genome Biol 2023;24:274. - PMC - PubMed
1. Ballouz S, Dobin A, Gillis JA. et al. Is it time to change the reference genome? Genome Biol 2019;20:159. - PMC - PubMed
1. Breitwieser FP, Pertea M, Zimin AV. et al. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res 2019;29:954–60. - PMC - PubMed
1. Chin C-S, Behera S, Khalak A. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023;20:1213–21. - PMC - PubMed
1. Cochetel N, Minio A, Guarracino A. et al. A super-pangenome of the North American wild grape species. Genome Biol 2023;24:290. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

031A532B/German Network for Bioinformatics Infrastructure

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cluster-efficient pangenome graph construction with nf-core/pangenome

Affiliations

Cluster-efficient pangenome graph construction with nf-core/pangenome

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources