Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul;607(7920):732-740.
doi: 10.1038/s41586-022-04965-x. Epub 2022 Jul 20.

The sequences of 150,119 genomes in the UK Biobank

Bjarni V Halldorsson  1   2 Hannes P Eggertsson  3 Kristjan H S Moore  3 Hannes Hauswedell  3 Ogmundur Eiriksson  3 Magnus O Ulfarsson  3   4 Gunnar Palsson  3 Marteinn T Hardarson  3   5 Asmundur Oddsson  3 Brynjar O Jensson  3 Snaedis Kristmundsdottir  3   5 Brynja D Sigurpalsdottir  3   5 Olafur A Stefansson  3 Doruk Beyter  3 Guillaume Holley  3 Vinicius Tragante  3 Arnaldur Gylfason  3 Pall I Olason  3 Florian Zink  3 Margret Asgeirsdottir  3 Sverrir T Sverrisson  3 Brynjar Sigurdsson  3 Sigurjon A Gudjonsson  3 Gunnar T Sigurdsson  3 Gisli H Halldorsson  3 Gardar Sveinbjornsson  3 Kristjan Norland  3 Unnur Styrkarsdottir  3 Droplaug N Magnusdottir  3 Steinunn Snorradottir  3 Kari Kristinsson  3 Emilia Sobech  3 Helgi Jonsson  6   7 Arni J Geirsson  6 Isleifur Olafsson  6 Palmi Jonsson  6   7 Ole Birger Pedersen  8 Christian Erikstrup  9   10 Søren Brunak  11 Sisse Rye Ostrowski  12   13 DBDS Genetic ConsortiumGudmar Thorleifsson  3 Frosti Jonsson  3 Pall Melsted  3   4 Ingileif Jonsdottir  3   7 Thorunn Rafnar  3 Hilma Holm  3 Hreinn Stefansson  3 Jona Saemundsdottir  3 Daniel F Gudbjartsson  3   4 Olafur T Magnusson  3 Gisli Masson  3 Unnur Thorsteinsdottir  3   7 Agnar Helgason  3   14 Hakon Jonsson  3 Patrick Sulem  3 Kari Stefansson  15
Collaborators, Affiliations

The sequences of 150,119 genomes in the UK Biobank

Bjarni V Halldorsson et al. Nature. 2022 Jul.

Abstract

Detailed knowledge of how diversity in the sequence of the human genome affects phenotypic diversity depends on a comprehensive and reliable characterization of both sequences and phenotypic variation. Over the past decade, insights into this relationship have been obtained from whole-exome sequencing or whole-genome sequencing of large cohorts with rich phenotypic data1,2. Here we describe the analysis of whole-genome sequencing of 150,119 individuals from the UK Biobank3. This constitutes a set of high-quality variants, including 585,040,410 single-nucleotide polymorphisms, representing 7.0% of all possible human single-nucleotide polymorphisms, and 58,707,036 indels. This large set of variants allows us to characterize selection based on sequence variation within a population through a depletion rank score of windows along the genome. Depletion rank analysis shows that coding exons represent a small fraction of regions in the genome subject to strong sequence conservation. We define three cohorts within the UK Biobank: a large British Irish cohort, a smaller African cohort and a South Asian cohort. A haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals. We identified 895,055 structural variants and 2,536,688 microsatellites, groups of variants typically excluded from large-scale whole-genome sequencing studies. Using this formidable new resource, we provide several examples of trait associations for rare variants with large effects not found previously through studies based on whole-exome sequencing and/or imputation.

PubMed Disclaimer

Conflict of interest statement

B.V.H., H.P.E., K.H.S.M., H.Hauswedell, O.E., M.O.U., G.P., M.T.H., A.O., B.O.J., S.K., B.D.S., O.A.S., D.B., G.H., V.T., A.G., P.I.O., F.Z., M.A., S.T.S., B.S., S.A.G., G.T.S., G.H.H., G.S., K.N., U.S., D.N.M., S.S., K.K., E.S., G.T., F.J., P.M., I.J., T.R., H.Holm, H.S., J.S., D.F.G., O.T.M., G.M., U.T., A.H., H.J., P.S. and K.S. are employees of deCODE genetics/Amgen.

Figures

Fig. 1
Fig. 1. Mutation classes of sequence variants in the UKB.
a, Fraction of SNPs in each mutation class, for all SNPs in our dataset, singletons in our dataset and in an Icelandic set of de novo mutations (DNMs). b, Saturation levels of mutations in each class, split into singleton variants (blue) and more common variants (red). c, Saturation levels of transitions at methylated CpG sites across genomic annotations and predicted consequence categories. The horizontal line is the average across all methylated CpG sites. The error bars are 95% CIs, which were computed using a normal approximation, treating each CpG site as an independent observation The number of CpG sites used in c are: stop gained n = 46,670, missense n = 669,526, coding n = 1,067,847, splice n = 26,797, 5′ UTR n = 60,885, 3′ UTR n = 508,981, proximal n = 17,722,875 and intergenic n = 15,266,391.
Fig. 2
Fig. 2. Functionally important regions.
a, Fraction of regions falling into functional annotation classes, as defined by Ensembl gene map, as a function of DR. b, DR score as a function of distance from exon and LOEUF decile. Error bars represent 95% CI, computed using a normal approximation, treating each gene (n ranges between 1,206 and 1,848) as an independent observation. c, Fraction of rare (with four or fewer carriers) variants as a function of DR. d, Average GERP score in 500-bp windows as a function of DR. RS, rejected substitution. e,f, LOUEF (e) and LOEUF|GERP (f) as a function of DR. In e and f, middle bar indicates the average, hinges are the 25th and the 75th quantiles, black dots indicate outliers, and the whiskers extend to 1.5 interquartile range from the hinges to the largest or smallest value. The number of genes or observations in the DR ranges are the following: n(0–1) = 1,234, n(0.1–0.2) = 3,202, n(0.2–0.3) = 4,474, n(0.3–0.4) = 3,888, n(0.4–0.5) = 2,476, n(0.5–0.6) = 1,384, n(0.6–0.7) = 863, n(0.7–0.8) = 522, n(0.8–0.9) = 374 and n(0.9–1) = 427.
Fig. 3
Fig. 3. Cohort characteristics.
a, The number of WGS samples analysed for phenotypes in our study. b, UMAP plot generated from the first 40 principal components of all UKB participants, coloured by self-reported ethnicity: blue shades for ethnic labels under the white category (XBI), red shades for Black individuals (XAF) and green shades for South Asian individuals (XSA); for the full colour legend, see Supplementary Fig. 17. c, Joint frequency spectrum of variants on chromosome 20 between all pairs of populations. df, Characteristics of the XBI cohort across Great Britain and Ireland are shown: the number of singletons carried by individuals in the XBI cohort as a function of place of birth (d); the mean number of third-degree relatives by administrative division (e); and the location of UKB assessment centres and estimated fraction of the surrounding population recruited to the UKB (f). Differences in singleton counts and the number of third-degree relatives are probably a result of denser sampling of individuals living near UKB assessment centres. Fig. 3d–f by K.H.S.M.
Fig. 4
Fig. 4. Variant call set.
a, Number of SNPs, indels, microsatellites, SV insertions, SV deletions and singleton SNPs carried per diploid genome of individuals in the overall set and partitioned by population. b, Imputation accuracy in the three populations: XBI (left), XAF (middle) and XSA (right). A variant was considered imputed if ‘leave one out r2’ of phasing was greater than 0.5 and imputation information was greater than 0.8. The x axis splits variants into frequency classes based on the number of carriers in the sequence dataset. Variants are split by variant type. c, Number of SVs discovered in the dataset by variant type. d, Length distribution of SVs, from 50 to 1,000 bp, 1,000 to 10,000 bp and 10,000 to 100,000 bp.
Extended Data Fig. 1
Extended Data Fig. 1. Average score in 500bp windows as a function of Depletion Rank for.
a, CADD, b, Eigen, c, CDTS, and d, LINSIGHT. Green line represents average score, blue and red line 95-th percentile.
Extended Data Fig. 2
Extended Data Fig. 2
Geographic distribution of the loadings of the first four principal components of a PCA of the XBI population.
Extended Data Fig. 3
Extended Data Fig. 3
Cartogram-pies indicating the proportion of individuals born in each country (name shown on top of pies) in the XBI cohort. Pies are placed roughly according to their country’s position on a world map. Grey and white squares represent sea and land respectively.
Extended Data Fig. 4
Extended Data Fig. 4. Cartogram-pies indicating the proportion of individuals born in each country (name shown on top of pies) in the XAF cohort.
Pies are placed roughly according to their country’s position on a world map. Grey and white squares represent sea and land respectively.
Extended Data Fig. 5
Extended Data Fig. 5. Cartogram-pies indicating the proportion of individuals born in each country (name shown on top of pies) in the XSA cohort.
Pies are placed roughly according to their country’s position on a world map. Grey and white squares represent sea and land respectively.
Extended Data Fig. 6
Extended Data Fig. 6. Loss-of-function.
a, Correlation between the number of LoF genes per sample and fraction of genome with runs of homozygosity. Shaded region represents 95% confidence interval. b, Number of homozygous loss-of-function (LoF) genes per sample. Count of homozygous genes annotated as high impact with frequency <1%. Results are presented for XBI, XAF, XSA excluding individuals self-identified as Pakistani, individuals self-identified as Pakistani from the XSA cohort and Others.
Extended Data Fig. 7
Extended Data Fig. 7. Alternative alleles by region.
Numbers in brackets beneath region names indicate count of whole genome sequenced individuals with birthplaces in that region. Assignment of countries to regions is almost identical to the categorization displayed in the cohort cartogram pie figures, with the exception that all European regions are combined into one region in this figure. Vertical lines underneath density curves represent 0th, 25th, 50th, 75th, and 100th percentiles.
Extended Data Fig. 8
Extended Data Fig. 8. DR as a function of distance from coding exon partitioned by LOEUF22 deciles.
Results are shown separately for the overall dataset (All) and the individual cohorts, XBI, XAF and XSA.
Extended Data Fig. 9
Extended Data Fig. 9. RNA analysis of NMRK2.
a, Coverage plot of RNA-sequenced reads from heart tissue from 169 heart tissue samples over the gene NMRK2. One individual is a carrier of a 754bp deletion depicted with gray rectangle that includes exon 6 of NMRK2. The RNA-coverage of the carrier (blue) is lower over exon 6 compared to median coverage of non-carriers (green). Shading marks the deleted region. b, Histogram of fraction of RNA-sequenced fragments skipping exon 6 in NMRK2 out of all fragments aligning from the donor site of exon 5 to either acceptor site of exon 6 or exon 7. The median fraction fragments skipping for wild-type individuals is 0.035 and 0.57 for the carrier of the 754bp deletion.
Extended Data Fig. 10
Extended Data Fig. 10. Odds ratio for risk of myotonic dystrophy as a function of repeat length in microsatellite at the 3’ untranslated region of DMPK.
Carriers of at least 39.7 copies of the microsatellite repeat motif have a 162-fold increased risk of myotonic dystrophy. Error bars represent 95% confidence intervals, n = 431,079.

References

    1. Gudbjartsson DF, et al. Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet. 2015;47:435–444. doi: 10.1038/ng.3247. - DOI - PubMed
    1. Taliun D, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590:290–299. doi: 10.1038/s41586-021-03205-y. - DOI - PMC - PubMed
    1. Sudlow C, et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. doi: 10.1371/journal.pmed.1001779. - DOI - PMC - PubMed
    1. Fry A, et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. Am. J. Epidemiol. 2017;186:1026–1034. doi: 10.1093/aje/kwx246. - DOI - PMC - PubMed
    1. Bycroft C, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. doi: 10.1038/s41586-018-0579-z. - DOI - PMC - PubMed