Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(2):e30906.
doi: 10.1371/journal.pone.0030906. Epub 2012 Feb 17.

Efficient exact maximum a posteriori computation for bayesian SNP genotyping in polyploids

Affiliations

Efficient exact maximum a posteriori computation for bayesian SNP genotyping in polyploids

Oliver Serang et al. PLoS One. 2012.

Abstract

The problem of genotyping polyploids is extremely important for the creation of genetic maps and assembly of complex plant genomes. Despite its significance, polyploid genotyping still remains largely unsolved and suffers from a lack of statistical formality. In this paper a graphical bayesian model for SNP genotyping data is introduced. This model can infer genotypes even when the ploidy of the population is unknown. We also introduce an algorithm for finding the exact maximum a posteriori genotype configuration with this model. This algorithm is implemented in a freely available web-based software package SuperMASSA. We demonstrate the utility, efficiency, and flexibility of the model and algorithm by applying them to two different platforms, each of which is applied to a polyploid data set: Illumina GoldenGate data from potato and Sequenom MassARRAY data from sugarcane. Our method achieves state-of-the-art performance on both data sets and can be trivially adapted to use models that utilize prior information about any platform or species.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Raw data.
The scatter plot of allele intensities for PotSNP034 (A), tetraploid, and SugSNP225 (B), which has an unknown ploidy.
Figure 2
Figure 2. A Graphical View of SNP Genotyping.
Two models for SNP genotyping are presented. Variables are shown as nodes and solid arrows depict dependencies between variables. The observed data formula image depend on the genotypes of all individuals formula image. In both models the distribution of genotypes formula image is determined by the genotype configuration formula image. Also, in both models the probability of a genotype distribution depends on formula image, the distribution of genotypes in the population. Furthermore, both models use the same method to compute the probability of the data given the genotype configuration. Lastly, the possible genotypes depend on the ploidy formula image. (A) In the Hardy-Weinberg model, the distribution of genotypes in the population is determined by one of the allele frequencies formula image. (B) In the formula image model, the distribution of genotypes in the population depends on the parent genotypes formula image and formula image. The dashed arrows and nodes (formula image and formula image) depict optional dependencies and variables; these variables and dependencies exist only when data from the parents is included.
Figure 3
Figure 3. Illustration of Exact Inference.
Exact MAP computation can be performed by enumerating all possible genotype configurations. Because each individual's genotype is among formula image, searching through genotype configurations can be viewed as a tree in which each individual genotype assignment branches into formula image separate outcomes. (A) A naive search progresses downward through the tree and chooses the series of genotype assignments that lead to the highest posterior probability. A naive branch and bound method derived from this tree bounds genotype configurations for which the prefix determines that all subsequent paths are poor. (B) A multinomial graph ( i.e. the subset graph of the power-set of G) merges outcomes that result in the same genotype counts formula image. Multiple paths (from the top) can lead to any given set of genotype counts; therefore, dynamic programming is used. Given the layer above, each node can compute the most likely path from the top that leads to it. Once the most likely path and score are computed for each node in a layer, the next layer can progress. At the bottom layer, the node with the highest combined likelihood formula image (computed via dynamic programming) and formula image (the same for any path terminating at the node) maximizes the posterior probability. As in the naive tree, once all lower adjacent nodes in a subtree are provably suboptimal, then the subtree can be bounded. The dynamic programming approach is substantially more time space-efficient than the naive approach.
Figure 4
Figure 4. Illustration of a Suboptimal Genotype Configuration.
The essential motivation behind the geometric branch and bound is demonstrated. The top figure shows the original data and the bottom figure shows the data after being normalized to formula image and formula image within the likelihood function formula image. The two figures on the left correspond to a suboptimal genotype configuration. In the figures on the right, a pair of “blue” and “red” points (highlighted) are switched to the opposite class. After swapping the categories, the numbers of individuals with each genotype formula image do not change, but the total distance between these two points and their classes decreases. Decreasing this distance increases the likelihood while holding formula image constant. Thus the joint probability formula image. Because the MAP configuration cannot be improved by any such swaps, it must correspond to contiguous groups of class assignments along the normalized axis. Searching only the configurations that result in contiguous class assignments dramatically narrows the search space and makes inference computationally feasible where it wouldn't be with the dynamic programming branch and bound method.
Figure 5
Figure 5. Runtime of Exact MAP Computation with Dynamic Programming and Geometric Branch and Bound.
Both methods were run and timed while solving the same MAP inference problem. The y-axis plots the log of the runtime in seconds and the x-axis plots either the ploidy or the parameter formula image. Both methods were implemented in Python and run on sugarcane locus SugSNP225 and timed using user time. Parental data was not used. For this data set, the optimal ploidy and formula image values are formula image. In the figure on the left, formula image is held constant at formula image and the ploidy is varied from 2 to 10. In the figure on the right, the ploidy is fixed at 10 and formula image is doubled successively from formula image to formula image. In both figures, the dynamic programming branch and bound series is incomplete because the method was terminated after using more than 3 GB of RAM. In comparison, the geometric branch and bound method never used more than 50 MB. The growing gap between the methods indicates a superpolynomial speedup, especially when larger ploidies and larger values of formula image are used. For very low formula image values, the dynamic programming method is sometimes slightly faster due to decreased overhead.
Figure 6
Figure 6. SuperMASSA Output on Potato Loci.
For each potato locus (and the ploidy of the species analyzed), we show the output of SuperMASSA (using the –save_figures option). The first column shows the annotated scatter plot and the second column shows the theoretical distribution of genotypes in the population and the distribution of individuals assigned to each genotype. For both loci (in both the diploids and tetraploids), the genotype annotations are extremely close to the predicted angles for each assigned genotype and the genotype distributions are nearly identical to the theoretical distribution in a Hardy-Weinberg population using the MAP estimate for the parameter formula image.
Figure 7
Figure 7. SuperMASSA Output on Sugarcane Loci.
For each sugarcane locus, we show the output of SuperMASSA (using the –save_figures option). The first column shows the annotated scatter plot and the second column shows the theoretical distribution of genotypes in the population and the distribution of individuals assigned to each genotype. In all three loci, the MAP configuration simultaneously finds the ploidy, a set of parents with that ploidy, and genotype assignments with tight clusters that produce nearly identical theoretical genotype distributions and genotype distributions.

Similar articles

Cited by

References

    1. Hieter P, Griffiths T. Polyploidy–More Is More or Less. Science. 1999;285:210–211. - PubMed
    1. Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA. 1987;84:2363–2367. - PMC - PubMed
    1. Lander ES, Botstein D. Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps. Genetics. 1989;121:185–199. - PMC - PubMed
    1. Zeng ZB, Kao C, Basten CJ. Estimating the genetic architecture of quantitative traits. Genetical Research. 1999;74:279–289. - PubMed
    1. Lewin HA, Larkin DM, Pontius J. Every genome sequence needs a good map. Genome Research. 2009;19:1925–1928. - PMC - PubMed

Publication types