. 2019 May 7;9(5):1571-1579.

doi: 10.1534/g3.119.400165.

Cleaning Genotype Data from Diversity Outbred Mice

Karl W Broman¹, Daniel M Gatti², Karen L Svenson², Śaunak Sen³, Gary A Churchill²

Affiliations

¹ Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin 53706 broman@wisc.edu.
² The Jackson Laboratory, Bar Harbor, Maine 04609.
³ Department of Preventive Medicine, University of Tennessee Health Sciences Center, Memphis, Tennessee 38163.

PMID: 30877082
PMCID: PMC6505173
DOI: 10.1534/g3.119.400165

Cleaning Genotype Data from Diversity Outbred Mice

Karl W Broman et al. G3 (Bethesda). 2019.

. 2019 May 7;9(5):1571-1579.

doi: 10.1534/g3.119.400165.

Authors

Karl W Broman¹, Daniel M Gatti², Karen L Svenson², Śaunak Sen³, Gary A Churchill²

Affiliations

¹ Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin 53706 broman@wisc.edu.
² The Jackson Laboratory, Bar Harbor, Maine 04609.
³ Department of Preventive Medicine, University of Tennessee Health Sciences Center, Memphis, Tennessee 38163.

PMID: 30877082
PMCID: PMC6505173
DOI: 10.1534/g3.119.400165

Abstract

Data cleaning is an important first step in most statistical analyses, including efforts to map the genetic loci that contribute to variation in quantitative traits. Here we illustrate approaches to quality control and cleaning of array-based genotyping data for multiparent populations (experimental crosses derived from more than two founder strains), using MegaMUGA array data from a set of 291 Diversity Outbred (DO) mice. Our approach employs data visualizations that can reveal problems at the level of individual mice or with individual SNP markers. We find that the proportion of missing genotypes for each mouse is an effective indicator of sample quality. We use microarray probe intensities for SNPs on the X and Y chromosomes to confirm the sex of each mouse, and we use the proportion of matching SNP genotypes between pairs of mice to detect sample duplicates. We use a hidden Markov model (HMM) reconstruction of the founder haplotype mosaic across each mouse genome to estimate the number of crossovers and to identify potential genotyping errors. To evaluate marker quality, we find that missing data and genotyping error rates are the most effective diagnostics. We also examine the SNP genotype frequencies with markers grouped according to their minor allele frequency in the founder strains. For markers with high apparent error rates, a scatterplot of the allele-specific probe intensities can reveal the underlying cause of incorrect genotype calls. The decision to include or exclude low-quality samples can have a significant impact on the mapping results for a given study. We find that the impact of low-quality markers on a given study is often minimal, but reporting problematic markers can improve the utility of the genotyping array across many studies.

Keywords: MPP; Multiparent Advanced Generation Inter-Cross (MAGIC); QTL; data cleaning; data diagnostics; multiparental populations; quantitative trait loci.

PubMed Disclaimer

Figures

**Figure 1**
Percent missing genotypes by mouse. The nine mice with $\geq$ 20% missing genotypes are labeled with their sample identifiers.

**Figure 2**
Average SNP microarray intensity for markers on the Y chromosome *vs.* that for markers on the X chromosome, for each mouse. Mice that were nominally male are in purple, while females are in green. Samples with $\geq$ 20% missing genotypes are labeled in orange. Two other samples of interest are labeled in black: F386 which appears to be XO, and M377 which was nominally male but appears to be XX.

**Figure 3**
Distribution of array intensities after a ${log}_{10} (x + 1)$ transformation. A: Kernel density estimates of the array intensity distribution. Samples with $>$ 20% missing genotype data are in orange; samples with 9–20% missing genotype data are in pink; samples with 2–5% missing genotype data are in blue; the remaining samples are in gray. B: Scatterplot of the 1^st percentile *vs.* the 99^th percentile.

**Figure 4**
SNP genotype frequencies by mouse, for SNPs split by their minor allele frequency (MAF) in the eight founder strains. Trinomial probabilities are represented by points in an equilateral triangle using the distances to the three sides. Pink points indicate the expected distributions.

**Figure 5**
Estimated number of crossovers in each mouse. Colors of the points indicate the two groups of DO mice (generations 8 and 11). Mice with $\geq$ 20% missing genotypes are excluded.

**Figure 6**
Estimated percent genotyping errors for each mouse. The rates are very small; the median is just 7.8 in 10,000.

**Figure 7**
Estimated percent genotyping errors *vs.* percent missing genotypes by marker. Errors defined by genotyping error LOD score $>$ 2. The vast majority of markers showed no apparent errors.

**Figure 8**
SNP genotype frequencies by marker, with SNPs split by their minor allele frequency (MAF) in the eight founder strains. Trinomial probabilities are represented by points in an equilateral triangle using the distances to the three sides. Pink points indicate the expected distributions.

**Figure 9**
Allele intensity plots for four SNPs. In the left panels, points are colored according to the genotype calls, with yellow and blue being the two homozygotes and green being the heterozygote; gray points were not called. In the right panels, points are colored by the inferred SNP genotypes, given the multipoint marker data and the founders’ genotypes; gray points could not be inferred.

See this image and copyright information in PMC

References

1. Broman K. W., 2012a Genotype probabilities at intermediate generations in the construction of recombinant inbred lines. Genetics 190: 403–412. 10.1534/genetics.111.132647 - DOI - PMC - PubMed
1. Broman K. W., 2012b Haplotype probabilities in advanced intercross populations. G3 (Bethesda) 2: 199–202. 10.1534/g3.111.001818 - DOI - PMC - PubMed
1. Broman K. W., 2015. R/qtlcharts: interactive graphics for quantitative trait locus mapping. Genetics 199: 359–361. 10.1534/genetics.114.172742 - DOI - PMC - PubMed
1. Broman K. W., Gatti D. M., Simecek P., Furlotte N. A., Prins P., et al. , 2019. R/qtl2: Software for mapping quantitative trait loci with high-dimensional data and multi-parent populations. Genetics 211: 495–502. 10.1534/genetics.118.301595 - DOI - PMC - PubMed
1. Broman K. W., Keller M. P., Broman A. T., Kendziorski C., Yandell B. S., et al. , 2015. Identification and correction of sample mix-ups in expression genetic data: A case study. G3 (Bethesda) 5: 2177–2186. 10.1534/g3.115.019778 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Cleaning Genotype Data from Diversity Outbred Mice

Affiliations

Cleaning Genotype Data from Diversity Outbred Mice

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases