Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Jan 6;11(2):10.2202/1544-6115.1713 /j/sagmb.2012.11.issue-2/1544-6115.1713/1544-6115.1713.xml.
doi: 10.2202/1544-6115.1713.

A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data

Affiliations

A family-based probabilistic method for capturing de novo mutations from high-throughput short-read sequencing data

Reed A Cartwright et al. Stat Appl Genet Mol Biol. .

Abstract

Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Under-calling heterozygous genotypes affects de novo detection at a given site. In the top panel, the mother’s genotype is called AA by the sample-independent approach since the binomial probability of sampling once the C allele among 20 reads if the mother is heterozygote is very small. (Nielsen et al., 2011, suggest calling a site homozygous if the minor allele is less than 20%, a rule which we adopt for these examples.) When the family data is considered jointly, identifying a C in the child increases the probability of the AC genotype for the mother, leading to a low probability of de novo mutation at this site. (It is much more likely that the mother’s chromosomes were sampled unevenly, ≈10−5, than that there is an actual mutation at the site, ≈10−8.) In the bottom panel, the child’s genotype is called AA. However, given an error rate and the parental coverage, the probability of a de novo mutation at this site is high. The de novo mutation probabilities were computed using the method described here with the following parameters: θ = 0.001, ε = 0.005, μ = μs = 2 × 10−8. (See section 2.1 for a description of these parameters.)
Figure 2
Figure 2
Trio model for a single site. Nucleotide bases from the sequencing reads aligning to the site of interest (R = {RM, RF, RO}) are the observed data. Neither individual genotypes nor their transmission pattern are observed; they are the hidden data. Double lines denote the transmission/sampling of diploid genotypes, while single lines haploid genotypes. The parental zygotic genotypes (m = mamb and f = fa fb) are sampled from a common population and are the founding alleles of the pedigree. Wavy lines denote where sequencing takes place. The parameters in the model have been placed in proximity to the branches that they affect.
Figure 3
Figure 3
Transition points in model inferences. (A) transition from error to de novo mutation, (B) effect of depth on error-mutation transition, (C) effect of low parent depth, (D) effect of low offspring depth, (E) transition from de novo mutation to inherited allele, (F) transition from de novo mutation to inferred inheritance. The parameter values used are θ = 0.001, ε = 0.01, μ = 2 × 10−7, μs = 0.0. The read structures are given in the tables; each row represents the read data taken for a family member, with the offspring data (O) as the bottom row, and the columns represent the number of reads observed for each allele. ‘G’ and ‘T’ read counts are always 0.
Figure 3
Figure 3
Transition points in model inferences. (A) transition from error to de novo mutation, (B) effect of depth on error-mutation transition, (C) effect of low parent depth, (D) effect of low offspring depth, (E) transition from de novo mutation to inherited allele, (F) transition from de novo mutation to inferred inheritance. The parameter values used are θ = 0.001, ε = 0.01, μ = 2 × 10−7, μs = 0.0. The read structures are given in the tables; each row represents the read data taken for a family member, with the offspring data (O) as the bottom row, and the columns represent the number of reads observed for each allele. ‘G’ and ‘T’ read counts are always 0.
Figure 4
Figure 4
Rate estimates from repeated trio simulations. Ten simulation replicates of trios with 20× coverage were generated. Parameters were estimated for each replicate. ‘Simulated’ rates were calculated from the full data, and ‘estimated’ rates from the observable data. Simulated rates only differ from simulation parameters due to the stochastic nature of the simulations. Both sets are plotted relative to the simulation parameters: μ = 10−6, ε = 0.01007, and θ = 0.001.
Figure 5
Figure 5
Trio Simulation ROC curves. For each simulated set, the corresponding ROC curve and AUC value are presented based on all calls with δ ≥ 0.01. The parameter values used in simulations are θ = 0.001, ε = 0.01, μ = 1 × 10−6, and μs = 0.0. The dashed line shown along the diagonal represents the expected ROC curve for a random, un-useful classifier. A perfect classifier goes straight up then straight across. The “-Ctrl” results are for simulations in which all reads are perfectly aligned back to the reference genome.
Figure 6
Figure 6
De novo mutations predictions from trio simulations. 259 sites contained de novo mutations in the simulations (gray line). The total number of mutation calls at three different levels of δ are given by the triangles for each simulation. The circles indicate the amount of true positives for each δ threshold. The distances between triangles and circles represent the amount of false positives, and the distances between the circles and gray line gives the number of false negatives. The “-Ctrl” columns contain results for simulations in which all reads are perfectly aligned back to the reference genome.
Figure 7
Figure 7
Validation results stratified by de novo calling algorithm. Three methods were used to produce candidate sites of de novo mutation in two families (Conrad et al., 2011, Durbin et al., 2010). These candidates were then experimentally validated and classified as either “true mutations” (germline, somatic, or cell-line), “false mutations” (inherited genetic variation or no variation at all), or “inconclusive”. Sites not reported by a method but reported by another method appear in the No Call bar. FPIR corresponds to the method described in this paper. Modified from Conrad et al. (2011).
Figure A.1
Figure A.1
Possible genealogical trees for a sample of four alleles from a single population. The times of coalescent events, ti, are expessed in 2Ne generations, where Ne is the effective population size. Type I occurs twice as often as Type II.

References

    1. Awadalla P, et al. Direct measure of the de novo mutation rate in autism and schizophrenia cohorts. The American Journal of Human Genetics. 2010;87:316–324. - PMC - PubMed
    1. Conrad D, et al. Variation in genome-wide mutation rates within and between human families. Nature Genetics. 2011;43:712–714. - PMC - PubMed
    1. Dempster A, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 1977;39:1–38.
    1. Durbin R, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
    1. Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981;17:368–376. - PubMed

Publication types