Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 7;9(5):1571-1579.
doi: 10.1534/g3.119.400165.

Cleaning Genotype Data from Diversity Outbred Mice

Affiliations

Cleaning Genotype Data from Diversity Outbred Mice

Karl W Broman et al. G3 (Bethesda). .

Abstract

Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.

Keywords: MPP; Multiparent Advanced Generation Inter-Cross (MAGIC); QTL; data cleaning; data diagnostics; multiparental populations; quantitative trait loci.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Percent missing genotypes by mouse. The nine mice with 20% missing genotypes are labeled with their sample identifiers.
Figure 2
Figure 2
Average SNP microarray intensity for markers on the Y chromosome vs. that for markers on the X chromosome, for each mouse. Mice that were nominally male are in purple, while females are in green. Samples with 20% missing genotypes are labeled in orange. Two other samples of interest are labeled in black: F386 which appears to be XO, and M377 which was nominally male but appears to be XX.
Figure 3
Figure 3
Distribution of array intensities after a log10(x+1) transformation. A: Kernel density estimates of the array intensity distribution. Samples with >20% missing genotype data are in orange; samples with 9–20% missing genotype data are in pink; samples with 2–5% missing genotype data are in blue; the remaining samples are in gray. B: Scatterplot of the 1st percentile vs. the 99th percentile.
Figure 4
Figure 4
SNP genotype frequencies by mouse, for SNPs split by their minor allele frequency (MAF) in the eight founder strains. Trinomial probabilities are represented by points in an equilateral triangle using the distances to the three sides. Pink points indicate the expected distributions.
Figure 5
Figure 5
Estimated number of crossovers in each mouse. Colors of the points indicate the two groups of DO mice (generations 8 and 11). Mice with 20% missing genotypes are excluded.
Figure 6
Figure 6
Estimated percent genotyping errors for each mouse. The rates are very small; the median is just 7.8 in 10,000.
Figure 7
Figure 7
Estimated percent genotyping errors vs. percent missing genotypes by marker. Errors defined by genotyping error LOD score > 2. The vast majority of markers showed no apparent errors.
Figure 8
Figure 8
SNP genotype frequencies by marker, with SNPs split by their minor allele frequency (MAF) in the eight founder strains. Trinomial probabilities are represented by points in an equilateral triangle using the distances to the three sides. Pink points indicate the expected distributions.
Figure 9
Figure 9
Allele intensity plots for four SNPs. In the left panels, points are colored according to the genotype calls, with yellow and blue being the two homozygotes and green being the heterozygote; gray points were not called. In the right panels, points are colored by the inferred SNP genotypes, given the multipoint marker data and the founders’ genotypes; gray points could not be inferred.

References

    1. Broman K. W., 2012a Genotype probabilities at intermediate generations in the construction of recombinant inbred lines. Genetics 190: 403–412. 10.1534/genetics.111.132647 - DOI - PMC - PubMed
    1. Broman K. W., 2012b Haplotype probabilities in advanced intercross populations. G3 (Bethesda) 2: 199–202. 10.1534/g3.111.001818 - DOI - PMC - PubMed
    1. Broman K. W., 2015. R/qtlcharts: interactive graphics for quantitative trait locus mapping. Genetics 199: 359–361. 10.1534/genetics.114.172742 - DOI - PMC - PubMed
    1. Broman K. W., Gatti D. M., Simecek P., Furlotte N. A., Prins P., et al. , 2019. R/qtl2: Software for mapping quantitative trait loci with high-dimensional data and multi-parent populations. Genetics 211: 495–502. 10.1534/genetics.118.301595 - DOI - PMC - PubMed
    1. Broman K. W., Keller M. P., Broman A. T., Kendziorski C., Yandell B. S., et al. , 2015. Identification and correction of sample mix-ups in expression genetic data: A case study. G3 (Bethesda) 5: 2177–2186. 10.1534/g3.115.019778 - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources