Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 1;40(11):btae609.
doi: 10.1093/bioinformatics/btae609.

Cluster-efficient pangenome graph construction with nf-core/pangenome

Affiliations

Cluster-efficient pangenome graph construction with nf-core/pangenome

Simon Heumos et al. Bioinformatics. .

Abstract

Motivation: Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time.

Results: To overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core's best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions.

Availability and implementation: nf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.

PubMed Disclaimer

Conflict of interest statement

Author L.H. is employed by LaminLabs.

Figures

Figure 1.
Figure 1.
(a) Schematic representation of the nf-core/pangenome workflow processes and detailed analysis steps. The input consists of one FASTA file containing all sequences. The pipeline comes with three major entry points: (1) community detection, which identifies clusters of related sequences or regions in the pangenome graph to reveal biologically significant patterns like conserved or divergent areas across genomes (Supplementary Material 5.2), (2) alignment distribution, and (3) core workflow. Optional community detection (1) is performed on the input sequences. If selected, the heavy all-to-all base-pair level alignments (2) can be split into problems of equal size. nf-core/pangenome’s core workflow (3) is a direct mirror of PGGB. If running in community mode, all communal graphs are combined into one (4) and the subsequent quality control subworkflow is executed. The output is a pangenome graph in GFA format. (b, c) Pangenome growth curves of the built pangenome graphs. Growth type is defined as the minimum fraction of haplotypes that must share a graph feature after each time a haplotype is added to the growth histograph. quorum>=0: all sequences without any filtering are considered. quorum>=10: sequences traversed by at least 10% of the haplotypes. quorum>=50: sequences traversed by at least 50% of haplotypes. quorum>=95: sequences traversed by 95% of haplotypes. (b) Pangenome growth curve of the chromosome 19 pangenome graph of 1000 haplotypes. (c) Pangenome growth curve of the Escherichia coli pangenome graph of 2013 haplotypes.

References

    1. Andreace F, Lechat P, Dufresne Y. et al. Comparing methods for constructing and representing human pangenome graphs. Genome Biol 2023;24:274. - PMC - PubMed
    1. Ballouz S, Dobin A, Gillis JA. et al. Is it time to change the reference genome? Genome Biol 2019;20:159. - PMC - PubMed
    1. Breitwieser FP, Pertea M, Zimin AV. et al. Human contamination in bacterial genomes has created thousands of spurious proteins. Genome Res 2019;29:954–60. - PMC - PubMed
    1. Chin C-S, Behera S, Khalak A. et al. Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes. Nat Methods 2023;20:1213–21. - PMC - PubMed
    1. Cochetel N, Minio A, Guarracino A. et al. A super-pangenome of the North American wild grape species. Genome Biol 2023;24:290. - PMC - PubMed