. 2022 Aug 4;13(1):4384.

doi: 10.1038/s41467-022-31724-3.

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Affiliations

¹ Seven Bridges Genomics, Charlestown, MA, USA. serhat.tetikol@sevenbridges.com.
² Seven Bridges Genomics, Charlestown, MA, USA.

^# Contributed equally.

PMID: 35927245
PMCID: PMC9352875
DOI: 10.1038/s41467-022-31724-3

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

H Serhat Tetikol et al. Nat Commun. 2022.

. 2022 Aug 4;13(1):4384.

doi: 10.1038/s41467-022-31724-3.

Affiliations

¹ Seven Bridges Genomics, Charlestown, MA, USA. serhat.tetikol@sevenbridges.com.
² Seven Bridges Genomics, Charlestown, MA, USA.

^# Contributed equally.

PMID: 35927245
PMCID: PMC9352875
DOI: 10.1038/s41467-022-31724-3

Abstract

Graph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference to represent the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based toolkits for NGS read alignment and variant calling, methods to curate genomic variants and subsequently construct genome graphs remain an understudied problem that inevitably determines the effectiveness of the overall bioinformatics pipeline. In this study, we discuss obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and demonstrate this approach on the whole-genome samples of African ancestry. Our results show that population-specific graphs, as more representative alternatives to linear or generic graph references, can achieve significantly lower read mapping errors and enhanced variant calling sensitivity, in addition to providing the improvements of joint variant calling without the need of computationally intensive post-processing steps.

PubMed Disclaimer

Conflict of interest statement

All authors have been employed by Seven Bridges Genomics Inc. during this study.

Figures

**Fig. 1. Steps involved in a multi-phase sequencing project.**
A Large-scale sequencing projects are commonly executed in multiple phases, each comprising the sequencing and bioinformatics analysis of only a subset of the samples that are planned to be sequenced throughout the project (Large-scale Project Cycle). This iterative nature provides the opportunity to produce genomic information in each cycle that can be used to improve the bioinformatics processes (Perpetual Improvement of Graph Genomes). Graph-based secondary analysis approaches can utilize this information to improve the variant detection power for subsequent cycles. B Iterative population-specific graph construction workflow. The initial population-specific graph reference (Pan-African 0) is constructed using publicly available variant databases. At each iteration, a subset of the population (construction set) is processed with the current graph, and the variant calls are used to construct the next graph. This process is repeated until the entire construction set is exhausted. All graph references are tested on the same benchmarking set and their performance is evaluated. The population-specific graphs (Pan-African 0-5) are also compared to a generic graph (Pan-Genome) containing genetic information from many populations and to a linear approach using only GRCh38 reference.

**Fig. 2. Population-specific graph construction summary.**
A Nucleotide diversity and divergence with respect to GRCh38 linear reference for each super-population in the 1000 Genomes dataset: African ancestry (AFR), American ancestry (AMR), South-Asian ancestry (SAS), East-Asian ancestry (EAS), European ancestry (EUR). B True positive (TPR) and false positive (FPR) rates in the constructed graph references as a function of number of samples used in construction for homogeneus (solid lines) and expected (dashed lines) sampling for super-populations; AFR (blue), AMR (orange), EAS (green), EUR (red), SAS (purple). C Overview of the graph construction method. D Summary statistics for Pan-African graphs constructed at each iteration of the workflow shown in Fig. 1B. Source data are provided as a Source Data file.

**Fig. 3. Alignment metrics for BWA (red), Pan-Genome (blue), and Pan-African Iterations (green).**
Rate of unmapped (A), improper (B), multi-mapped (MAPQ = 0) (C), uninformative (MAPQ < 20) (D), and informative reads (MAPQ ≥ 20) (E). F Alignment error rate. Error rate is the ratio of mismatches to aligned bases in read alignments with respect to the reference. Two-sided Wilcoxon tests between consecutive distributions are performed. In all cases, except for one (uninformative reads between iterations 2 and 3), the difference is significant (p < 10⁻³). G Total number of variants in graph (solid bars) and per sample mean of number of used variants/edges in alignment (dashed bars). Magenta line shows the ratio of used variants to the graph size. H Categorization of variant utilization in alignment with respect to the number of samples: 0% (pink), below 50% (purple), above 50% (green), 100% (yellow). Source data are provided as a Source Data file.

**Fig. 4. Variant calling results for BWA+GATK (red), Pan-Genome (blue), and Pan-African Iterations (green).**
A Sample distribution of SNP counts, B cumulative AF distribution of SNPs separated into shared variants (solid lines), unique variants (dashed lines), and common variants with allele frequency difference (dotted lines), C INDEL counts, D cumulative AF distribution of INDELs separated into shared variants (solid lines), unique variants (dashed lines) and common variants with allele frequency difference (dotted lines), E structural variant (SV) counts, F size distribution of detected SVs, and G percentage of loci called by the graph pipeline for the variants rescued in the traditional joint calling (results are split based on the filtration output of VQSR). Two-sided Wilcoxon tests between consecutive distributions are performed for A, C, and E. In all cases, the difference is significant (p < 10⁻²¹). Source data are provided as a Source Data file.

See this image and copyright information in PMC

Cited by

A Pangenomic Approach to Improve Population Genetics Analysis and Reference Bias in Underrepresented Middle Eastern and Horn of Africa Populations.
Oliva A, Foare R, Campbell P, Twine NA, Bauer DC, Johar AS. Oliva A, et al. Biomolecules. 2025 Apr 15;15(4):582. doi: 10.3390/biom15040582. Biomolecules. 2025. PMID: 40305331 Free PMC article.
Pangenome graphs and their applications in biodiversity genomics.
Secomandi S, Gallo GR, Rossi R, Rodríguez Fernandes C, Jarvis ED, Bonisoli-Alquati A, Gianfranceschi L, Formenti G. Secomandi S, et al. Nat Genet. 2025 Jan;57(1):13-26. doi: 10.1038/s41588-024-02029-6. Epub 2025 Jan 8. Nat Genet. 2025. PMID: 39779953 Review.
Accurate human genome analysis with element avidity sequencing.
Carroll A, Kolesnikov A, Cook DE, Brambrink L, Wiseman KN, Billings SM, Kruglyak S, Lajoie BR, Zhao J, Levy SE, McLean CY, Shafin K, Nattestad M, Chang PC. Carroll A, et al. BMC Bioinformatics. 2025 Jul 25;26(1):194. doi: 10.1186/s12859-025-06191-4. BMC Bioinformatics. 2025. PMID: 40713517 Free PMC article.
Phased genome assemblies and pangenome graphs of human populations of Japan and Saudi Arabia.
Kulmanov M, Ashouri S, Liu Y, Abdelhakim M, Alsolme E, Nagasaki M, Ohkawa Y, Suzuki Y, Tawfiq R, Tokunaga K, Katayama T, Abedalthagafi MS, Hoehndorf R, Kawai Y. Kulmanov M, et al. Sci Data. 2025 Aug 12;12(1):1316. doi: 10.1038/s41597-025-05652-y. Sci Data. 2025. PMID: 40796583 Free PMC article.
Personalizing medicine in Africa: current state, progress and challenges.
Owolabi P, Adam Y, Adebiyi E. Owolabi P, et al. Front Genet. 2023 Sep 19;14:1233338. doi: 10.3389/fgene.2023.1233338. eCollection 2023. Front Genet. 2023. PMID: 37795248 Free PMC article. Review.

See all "Cited by" articles

References

1. International Human Genome Sequencing Consortium et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
1. Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. - DOI - PubMed
1. Green RE, et al. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. - DOI - PMC - PubMed
1. E pluribus unum. Nat. Methods7, 331 (2010). - PubMed
1. Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome? Genome Biol. 2019;20:1–9. doi: 10.1186/s13059-019-1774-4. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Affiliations

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources