Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar;32(3):569-582.
doi: 10.1101/gr.276013.121. Epub 2022 Jan 24.

Mitochondrial DNA variation across 56,434 individuals in gnomAD

Affiliations

Mitochondrial DNA variation across 56,434 individuals in gnomAD

Kristen M Laricchia et al. Genome Res. 2022 Mar.

Abstract

Genomic databases of allele frequency are extremely helpful for evaluating clinical variants of unknown significance; however, until now, databases such as the Genome Aggregation Database (gnomAD) have focused on nuclear DNA and have ignored the mitochondrial genome (mtDNA). Here, we present a pipeline to call mtDNA variants that addresses three technical challenges: (1) detecting homoplasmic and heteroplasmic variants, present, respectively, in all or a fraction of mtDNA molecules; (2) circular mtDNA genome; and (3) misalignment of nuclear sequences of mitochondrial origin (NUMTs). We observed that mtDNA copy number per cell varied across gnomAD cohorts and influenced the fraction of NUMT-derived false-positive variant calls, which can account for the majority of putative heteroplasmies. To avoid false positives, we excluded contaminated samples, cell lines, and samples prone to NUMT misalignment due to few mtDNA copies. Furthermore, we report variants with heteroplasmy ≥10%. We applied this pipeline to 56,434 whole-genome sequences in the gnomAD v3.1 database that includes individuals of European (58%), African (25%), Latino (10%), and Asian (5%) ancestry. Our gnomAD v3.1 release contains population frequencies for 10,850 unique mtDNA variants at more than half of all mtDNA bases. Importantly, we report frequencies within each nuclear ancestral population and mitochondrial haplogroup. Homoplasmic variants account for most variant calls (98%) and unique variants (85%). We observed that 1/250 individuals carry a pathogenic mtDNA variant with heteroplasmy above 10%. These mtDNA population allele frequencies are freely accessible and will aid in diagnostic interpretation and research studies.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Coverage statistics for 70,375 gnomAD WGS samples. (A) Per-base mean depth of coverage across mtDNA, with coverage dips at positions 303–315 (Tan et al. 2016) and 3107 (Bandelt et al. 2014) due to homopolymeric tract and Chr M reference deletion, respectively. (B) For each cohort within gnomAD, a scatterplot shows the mean nuclear (nDNA) and mtDNA coverage ± standard deviation. Three example cohorts are shown in color: 1000 Genomes and Human Genome Diversity Project cell lines (1KG/HGDP), NHLBI, and TOPMed Chronic Obstructive Pulmonary Disease (TOPMED COPD). (C) Histogram shows mean mtDNA coverage for all samples, and overlaid histograms show three selected cohorts (806 outliers with coverage 15,000–97,000 excluded). We note mean and median mtDNA coverage statistics are extremely similar (Pearson's r = 0.99997). (D) Histogram shows median nDNA coverage for all samples, and overlaid histograms show three selected cohorts (84 outliers with coverage 60–94 excluded). (E) Histogram shows mtDNA copy number per cell (2 × mean mtDNA coverage/ median nDNA coverage) for all samples, and overlaid histograms show three selected cohorts (223 outliers with mtCN 1250–7000 excluded). Only samples with mtCN 50–500 (dashed lines) were included in the released mtDNA call set (56,434/70,375).
Figure 2.
Figure 2.
mtDNA call set is designed to exclude NUMT-derived false positives (NUMT-FPs), cell line artifacts, and contaminants. (A) Schematic shows GATK pipeline for calling mtDNA variants in single WGS samples. The control region spans the artificial break in Chromosome M sequence. (B) Reproducibility of GATK pipeline on 91 WGS replicate samples shows 99.3% concordance of calls (2533/2551), and density plot at top shows 87% variants are homoplasmic. (C) Accuracy of single-sample pipeline in samples with mtCN > 500 based on “in silico” mixing data. Note these are valid only for samples with high mtCN. (D) Bar chart shows that the mean number of putative heteroplasmies per sample depends on mtDNA copy number (mtCN), as does the subset occurring at 25 validated NUMT-FP sites (red). (E) Scatterplot shows the observed VAF for a single NUMT-FP (m.16293A > C) across 6844 samples versus the theoretical VAF if the NUMTs were heterozygous and all reads misaligned to the mtDNA. (F) Plot shows VAF levels for NUMT-FP sites decrease with mtCN (colored lines). Y-axis indicates the percent of detected variants that occur at 25 NUMT-FP sites. (G) Density plot shows mtCN for known cell lines and all other samples. (H) Bar plot shows that known cell lines have increased number of heteroplasmic variants in all categories compared to samples with mtCN 50–500 (enrichment shown with *** indicates P-value < 1 × 10−5 based on Fisher's exact test); pLOF indicates predicted loss-of-function. (I) Schematic shows steps for combining and filtering single-sample variant calls into the gnomAD mtDNA call set, designed to exclude NUMT-derived false positives, cell line artifacts, and contaminants. (J) Number of unique variants that pass filters (bold black) versus those filtered out based on VAF (black) or not released (gray). The 19,137 variants are partitioned into mutually exclusive categories; for example, VAF 0.10–0.95 excludes variants also detected VAF 0.95–1.00. (K) For each VAF level, bar chart shows the fraction of variants at 25 NUMT-FP sites before sample filtering (red) or after filtering (orange, shown overlaid). (L) Histogram of VAF (after sample filtering) shows that below 10% VAF, there are a large number of variants and a substantial fraction present at 25 validated NUMT-FP sites (red). X-axis label indicates upper bound of VAF bin.
Figure 3.
Figure 3.
gnomAD mtDNA variant statistics. (A) Pie charts summarize statistics on mtDNA bases with variants, unique variants, and total variant calls. (B) Bar plot shows the proportion of unique mtDNA variants detected at different population allele frequencies in gnomAD v3.1. (C) Bar chart shows the proportion of variants that are observed only at 10%–95% heteroplasmy (gray) or observed at homoplasmy (blue) including those that are known haplogroup markers in Phylotree (dark blue). (D) Histogram shows number of heteroplasmies per sample (VAF 0.10–0.95). (E) Stacked bar charts show the distribution of variant annotations in the entire mtDNA and for unique variants that are homoplasmic or only observed at heteroplasmy.
Figure 4.
Figure 4.
gnomAD v3.1 samples by mtDNA haplogroup and nuclear ancestry. (A) The number of samples is shown by mtDNA top-level haplogroup. Color indicates mtDNA haplogroups phylogenetically associated with African (purple), Asian (green), or European (blue) origin (Lott et al. 2013). (B) For each haplogroup, box plots show the number of homoplasmic SNVs per sample compared to the GRCh38 reference genome (haplogroup H) with the median shown in color. (C) For each haplogroup, stacked bar charts show nuclear ancestry from nuclear genome analysis, with colors as in panel E. (D) For each haplogroup, the percentage of samples from each inferred nuclear ancestry is shown in a heat map. Dash indicates 0 samples, and 0 indicates a percentage between 0–1. (E) The number of samples is shown by inferred nuclear ancestry. (F) For each inferred nuclear ancestry shown in panel D, stacked bar chart shows mtDNA haplogroups phylogenetically associated with African (purple), Asian (green), or European (blue) origin (Lott et al. 2013).
Figure 5.
Figure 5.
Patterns of variation in the mtDNA in gnomAD. (A) The bar chart shows the proportion of possible SNVs observed, partitioned into those observed at homoplasmy (black), only at 10%–95% heteroplasmy (gray), or not observed (white). (B) The box plot shows the maximum heteroplasmy of variants observed only at heteroplasmy. Protein indels include frameshift and in-frame variants. “Control reg.” represents the noncoding control region m.16024-576 in A and B. (C) The bar chart shows the proportion of possible synonymous variants observed in gnomAD for transversions (Tv) and all possible transitions (A > G, C > T, G > A, T > C) on the reference strand. (D) The bar chart shows the proportion of codons in protein-coding genes with nonsynonymous SNVs observed. (E,F) The proportion of bases in tRNA and rRNA genes with SNVs. Panels CF follow the color legend in A.
Figure 6.
Figure 6.
Known pathogenic variants in gnomAD. Shown are the 26 pathogenic variants observed in gnomAD along with their heteroplasmy levels, haplogroup distribution, carrier frequency, MITOMAP-curated disease phenotypes, and indicator showing whether disease occurs at homoplasmy (Hom. reported; note this includes variants only associated with disease at homoplasmy, or at both homoplasmy and heteroplasmy). The carrier frequency is calculated as the high-quality allele count divided by the number of individuals with high-quality sequence at the position. The dark gray line at the 95% heteroplasmy level represents the threshold used to define homoplasmic variant calls. Haplogroups are ordered by their position in the phylogenetic tree and colored by their association with African (purple), Asian (green), or European (blue) ancestry. (AMDF) Ataxia, myoclonus, and deafness, (COX) cytochrome c oxidase, (DEAF) maternally inherited deafness or aminoglycoside-induced deafness, (EXIT) exercise intolerance, (LHON) Leber Hereditary Optic Neuropathy, (LS) Leigh syndrome, (MELAS) mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episodes, (MERRF) myoclonic epilepsy and ragged red muscle fibers, (MLASA) mitochondrial myopathy, lactic acidosis, and sideroblastic anemia, (MM) mitochondrial myopathy, (NARP) neurogenic muscle weakness, ataxia, and retinitis pigmentosa, (SNHL) sensorineural hearing loss, (other) other phenotypes listed for this variant in MITOMAP.

References

    1. Anderson S, Bankier AT, Barrell BG, de Bruijn MH, Coulson AR, Drouin J, Eperon IC, Nierlich DP, Roe BA, Sanger F, et al. 1981. Sequence and organization of the human mitochondrial genome. Nature 290: 457–465. 10.1038/290457a0 - DOI - PubMed
    1. Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, Howell N. 1999. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet 23: 147. 10.1038/13779 - DOI - PubMed
    1. Bandelt H-J, Kloss-Brandstätter A, Richards MB, Yao Y-G, Logan I. 2014. The case for the continuing use of the revised Cambridge Reference Sequence (rCRS) and the standardization of notation in human mitochondrial DNA studies. J Hum Genet 59: 66–77. 10.1038/jhg.2013.120 - DOI - PubMed
    1. Benjamin DI, Sato T, Cibulskis K, Getz G, Stewart C, Lichtenstein L. 2019. Calling somatic SNVs and indels with Mutect2. bioRxiv 10.1101/861054 - DOI
    1. Bolze A, Mendez F, White S, Tanudjaja F, Isaksson M, Jiang R, Rossi AD, Cirulli ET, Rashkin M, Metcalf WJ, et al. 2020. A catalog of homoplasmic and heteroplasmic mitochondrial DNA variants in humans. bioRxiv 10.1101/798264 - DOI

Publication types

Substances

LinkOut - more resources