Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 1;39(1):btac729.
doi: 10.1093/bioinformatics/btac729.

CAPG: comprehensive allopolyploid genotyper

Affiliations

CAPG: comprehensive allopolyploid genotyper

Roshan Kulkarni et al. Bioinformatics. .

Abstract

Motivation: Genotyping by sequencing is a powerful tool for investigating genetic variation in plants, but many economically important plants are allopolyploids, where homoeologous similarity obscures the subgenomic origin of reads and confounds allelic and homoeologous SNPs. Recent polyploid genotyping methods use allelic frequencies, rate of heterozygosity, parental cross or other information to resolve read assignment, but good subgenomic references offer the most direct information. The typical strategy aligns reads to the joint reference, performs diploid genotyping within each subgenome, and filters the results, but persistent read misassignment results in an excess of false heterozygous calls.

Results: We introduce the Comprehensive Allopolyploid Genotyper (CAPG), which formulates an explicit likelihood to weight read alignments against both subgenomic references and genotype individual allopolyploids from whole-genome resequencing data. We demonstrate CAPG in allotetraploids, where it performs better than Genome Analysis Toolkit's HaplotypeCaller applied to reads aligned to the combined subgenomic references.

Availability and implementation: Code and tutorials are available at https://github.com/Kkulkarni1/CAPG.git.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Distinguishing homoeologous and allelic SNPs. Allotetraploid genomes for two individuals, subgenome A (left), subgenome B (right). Pink sites are homoeologous SNPs, different between and constant within subgenomes. Other colored sites are allelic SNPs, green in subgenome A, red in subgenome B. The dark green and brown sites are homozygous. The light green and red sites are heterozygous in one of the individuals (A color version of this figure appears in the online version of this article)
Fig. 2.
Fig. 2.
Genotyping an allelic SNP with CAPG. Blue: subgenome A reference sequence; Orange: subgenome B reference sequence; Pink: homoeologous SNPs at sites 3, 8 and 12; Green: segregating allele in subgenome A at site 6; Brown: segregating allele in subgenome B at site 11; Green ellipse identifies the site to genotype in this example, with genotype GA/GG yielding the highest posterior probability of 0.83 (A color version of this figure appears in the online version of this article)
Fig. 3.
Fig. 3.
PR curves for CAPG and GATK on simulated data. Performance of CAPG and GATK metrics to identify (a) heterozygous sites in subgenome A, (b) allelic SNPs in subgenome A and (c) homoeologous SNPs when coverage c =10 and homoeologous rate rh=0.007. Heterozygous data are subsampled with all true positive sites and 100 000 randomly sampled true negative sites. Circles/triangles represent the threshold value (0), a liberal choice (high recall, low precision, and also see Supplementary Fig. S4) for genotyping heterozygotes and progressively more conservative with sample size for SNP calling
Fig. 4.
Fig. 4.
Comparing CAPG and GATK metrics in simulation. Scatter plot of heterozygosity, allelic SNP and homoeologous SNP metrics on simulated data with coverage c =40 and homoeologous rate rh=0.007. Homoeologous metrics (Eq. (6) and Supplementary Eq. (S5)) are transformed via Box–Cox transformation (x+0.1)λ1λ,λ=0.2 to avoid overplotting at upper right, and the stack of points on the left represents a transformation of CAPG metric value (see Section 3). There remain overplotted true homoeologous SNPs, but all true negatives (green) are visible after transformation. In addition to subsampling done for heterozygosity (see Fig. 3), we further subsample to avoid excess overplotting, keeping all points with CAPG metric > 0 for heterozygosity or finite for homoeologous SNPs and subsampling 10% of all other points
Fig. 5.
Fig. 5.
Comparing CAPG and GATK metrics in real peanut data. Scatter plot of metrics for heterozygosity, allelic and homoeologous SNPs. We examine alignments to confirm (True: pink boxed ‘x’), reject (False: red plus) or fail to resolve (Unclear: gray star) a small selection of sites. Otherwise, the heterozygosity status is unknown (gray circle), but we indicate if there is a subgenomic reference nucleotide match (green triangle) or mismatch (purple square) in the allelic and homoeologous facets. After including all hand-verified sites, a stratified sample was taken to over-sample likely heterozygous calls by either method, so 50% of sampled sites have CAPG or GATK metrics above the 99.5th percentile. For allelic SNP metrics, we sampled 25% sites with subgenomic mismatch, 25% sites with either CAPG or GATK metric above the 99.5th percentile, 25% sites with subgenomic match and 25% with both metrics below the 99.5th percentile. For homeologous SNP metrics, we sampled 50% sites with subgenomic mismatch and 50% with subgenomic match and low metrics by CAPG and GATK. For an unbiased view of the metrics, see Supplementary Figures S7–S9

Similar articles

Cited by

References

    1. Altschul S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Bertioli D.J. et al. (2016) The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut. Nat. Genet., 48, 438–446. - PubMed
    1. Bertioli D.J. et al. (2019) The genome sequence of segmental allotetraploid peanut Arachis hypogaea. Nat. Genet., 51, 877–884. - PubMed
    1. Blischak P.D. et al. (2018) SNP genotyping and parameter estimation in polyploids using low-coverage sequencing data. Bioinformatics, 34, 407–415. - PubMed
    1. Clark L.V. et al. (2019) polyRAD: genotype calling with uncertainty from sequencing data in polyploids and diploids. G3 (Bethesda), 9, 663–673. - PMC - PubMed

Publication types