Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May;25(5):949-60.
doi: 10.1111/j.1420-9101.2012.02488.x. Epub 2012 Mar 16.

How to infer reliable diploid genotypes from NGS or traditional sequence data: from basic probability to experimental optimization

Affiliations
Free article

How to infer reliable diploid genotypes from NGS or traditional sequence data: from basic probability to experimental optimization

A Chenuil. J Evol Biol. 2012 May.
Free article

Abstract

The use of diploid sequence markers is still challenging despite the good quality of the information they provide. There is a common problem to all sequencing approaches [traditional cloning and sequencing of PCR amplicons as well as next-generation sequencing (NGS)]: when no variation is found within the sequences from a given individual, homozygozity can never be asserted with certainty. As a consequence, sequence data from diploid markers are mostly analysed at the population (not the individual level) particularly in animal studies. This study aims at contributing to solve this. Using the Bayes theorem and the binomial law, useful results are derived, among which: (i) the number of sequence reads per individual (or sequencing depth) which is required to ensure, at a given probability threshold, that some heterozygotes are not considered erroneously as homozygotes, as a function of the observed heterozygozity (H(o) ) of the locus in the population; (ii) a way of estimating H(o) from low coverage NGS data; (iii) a way of testing the null hypothesis that a genetic marker corresponds to a single and diploid locus, in the absence of data from controlled crosses; (iv) strategies for characterizing sequence genotypes in populations minimizing the average number of sequence reads per individual; (v) a rationale to decide which are the variations that one needs to consider along the sequence, as a function of the sequencing depth affordable, the level of polymorphism desired and the risk of sequencing error. For traditional sequencing technology, optimal strategies appear surprisingly different from the usual empirical ones. The average number of sequence reads required to obtain 99% of fully determined genotypes never exceeds six, this value corresponding to the worst situation when H(o) equals 0.6. This threshold value of H(o) is strikingly stable when the tolerated proportion of nonfully resolved genotypes varies in a reasonable range. These results do not rely on the Hardy-Weinberg equilibrium assumption or on diallelism of nucleotidic sites.

PubMed Disclaimer