Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Aug 4;13(1):4384.
doi: 10.1038/s41467-022-31724-3.

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Affiliations

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

H Serhat Tetikol et al. Nat Commun. .

Abstract

Graph-based genome reference representations have seen significant development, motivated by the inadequacy of the current human genome reference to represent the diverse genetic information from different human populations and its inability to maintain the same level of accuracy for non-European ancestries. While there have been many efforts to develop computationally efficient graph-based toolkits for NGS read alignment and variant calling, methods to curate genomic variants and subsequently construct genome graphs remain an understudied problem that inevitably determines the effectiveness of the overall bioinformatics pipeline. In this study, we discuss obstacles encountered during graph construction and propose methods for sample selection based on population diversity, graph augmentation with structural variants and resolution of graph reference ambiguity caused by information overload. Moreover, we present the case for iteratively augmenting tailored genome graphs for targeted populations and demonstrate this approach on the whole-genome samples of African ancestry. Our results show that population-specific graphs, as more representative alternatives to linear or generic graph references, can achieve significantly lower read mapping errors and enhanced variant calling sensitivity, in addition to providing the improvements of joint variant calling without the need of computationally intensive post-processing steps.

PubMed Disclaimer

Conflict of interest statement

All authors have been employed by Seven Bridges Genomics Inc. during this study.

Figures

Fig. 1
Fig. 1. Steps involved in a multi-phase sequencing project.
A Large-scale sequencing projects are commonly executed in multiple phases, each comprising the sequencing and bioinformatics analysis of only a subset of the samples that are planned to be sequenced throughout the project (Large-scale Project Cycle). This iterative nature provides the opportunity to produce genomic information in each cycle that can be used to improve the bioinformatics processes (Perpetual Improvement of Graph Genomes). Graph-based secondary analysis approaches can utilize this information to improve the variant detection power for subsequent cycles. B Iterative population-specific graph construction workflow. The initial population-specific graph reference (Pan-African 0) is constructed using publicly available variant databases. At each iteration, a subset of the population (construction set) is processed with the current graph, and the variant calls are used to construct the next graph. This process is repeated until the entire construction set is exhausted. All graph references are tested on the same benchmarking set and their performance is evaluated. The population-specific graphs (Pan-African 0-5) are also compared to a generic graph (Pan-Genome) containing genetic information from many populations and to a linear approach using only GRCh38 reference.
Fig. 2
Fig. 2. Population-specific graph construction summary.
A Nucleotide diversity and divergence with respect to GRCh38 linear reference for each super-population in the 1000 Genomes dataset: African ancestry (AFR), American ancestry (AMR), South-Asian ancestry (SAS), East-Asian ancestry (EAS), European ancestry (EUR). B True positive (TPR) and false positive (FPR) rates in the constructed graph references as a function of number of samples used in construction for homogeneus (solid lines) and expected (dashed lines) sampling for super-populations; AFR (blue), AMR (orange), EAS (green), EUR (red), SAS (purple). C Overview of the graph construction method. D Summary statistics for Pan-African graphs constructed at each iteration of the workflow shown in Fig. 1B. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Alignment metrics for BWA (red), Pan-Genome (blue), and Pan-African Iterations (green).
Rate of unmapped (A), improper (B), multi-mapped (MAPQ = 0) (C), uninformative (MAPQ < 20) (D), and informative reads (MAPQ ≥ 20) (E). F Alignment error rate. Error rate is the ratio of mismatches to aligned bases in read alignments with respect to the reference. Two-sided Wilcoxon tests between consecutive distributions are performed. In all cases, except for one (uninformative reads between iterations 2 and 3), the difference is significant (p < 10−3). G Total number of variants in graph (solid bars) and per sample mean of number of used variants/edges in alignment (dashed bars). Magenta line shows the ratio of used variants to the graph size. H Categorization of variant utilization in alignment with respect to the number of samples: 0% (pink), below 50% (purple), above 50% (green), 100% (yellow). Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Variant calling results for BWA+GATK (red), Pan-Genome (blue), and Pan-African Iterations (green).
A Sample distribution of SNP counts, B cumulative AF distribution of SNPs separated into shared variants (solid lines), unique variants (dashed lines), and common variants with allele frequency difference (dotted lines), C INDEL counts, D cumulative AF distribution of INDELs separated into shared variants (solid lines), unique variants (dashed lines) and common variants with allele frequency difference (dotted lines), E structural variant (SV) counts, F size distribution of detected SVs, and G percentage of loci called by the graph pipeline for the variants rescued in the traditional joint calling (results are split based on the filtration output of VQSR). Two-sided Wilcoxon tests between consecutive distributions are performed for A, C, and E. In all cases, the difference is significant (p < 10−21). Source data are provided as a Source Data file.

Similar articles

Cited by

References

    1. International Human Genome Sequencing Consortium et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. - DOI - PubMed
    1. Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. - DOI - PubMed
    1. Green RE, et al. A draft sequence of the Neandertal genome. Science. 2010;328:710–722. doi: 10.1126/science.1188021. - DOI - PMC - PubMed
    1. E pluribus unum. Nat. Methods7, 331 (2010). - PubMed
    1. Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome? Genome Biol. 2019;20:1–9. doi: 10.1186/s13059-019-1774-4. - DOI - PMC - PubMed

Publication types