Review

. 2023 Sep;18(9):2625-2641.

doi: 10.1038/s41596-023-00853-4. Epub 2023 Jul 26.

Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease

Saori Sakaue^{1

2

3}, Saisriram Gurajala^{1

2

3}, Michelle Curtis^{1

2

3}, Yang Luo^{1

2

3

4}, Wanson Choi⁵, Kazuyoshi Ishigaki^{1

2

3

6}, Joyce B Kang^{1

2

3

7}, Laurie Rumker^{1

2

3

7}, Aaron J Deutsch^{3

8

9

10}, Sebastian Schönherr¹¹, Lukas Forer¹¹, Jonathon LeFaive^{12

13}, Christian Fuchsberger^{11

12

13

14}, Buhm Han^{5

15}, Tobias L Lenz¹⁶, Paul I W de Bakker¹⁷, Yukinori Okada^{18

19

20

21

22

23}, Albert V Smith^{12

13}, Soumya Raychaudhuri^{24

25

26

27

28}

Affiliations

¹ Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
² Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
³ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Kennedy Institute of Rheumatology, University of Oxford, Oxford, UK.
⁵ Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, South Korea.
⁶ Laboratory for Human Immunogenetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
⁷ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁸ Diabetes Unit, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
⁹ Center for Genomic Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
¹⁰ Program in Metabolism, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹¹ Institute of Genetic Epidemiology, Department of Genetics, Medical University of Innsbruck, Innsbruck, Austria.
¹² Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
¹³ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
¹⁴ Institute for Biomedicine, Eurac Research, Bolzano, Italy.
¹⁵ Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, South Korea.
¹⁶ Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, Hamburg, Germany.
¹⁷ Data and Computational Sciences, Vertex Pharmaceuticals, Boston, MA, USA.
¹⁸ Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan.
¹⁹ Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Suita, Japan.
²⁰ Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Suita, Japan.
²¹ Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
²² Center for Infectious Disease Education and Research (CiDER), Osaka University, Suita, Japan.
²³ Department of Genome Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
²⁴ Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
²⁵ Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
²⁶ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. soumya@broadinstitute.org.
²⁷ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
²⁸ Centre for Genetics and Genomics Versus Arthritis, University of Manchester, Manchester, UK. soumya@broadinstitute.org.

PMID: 37495751
PMCID: PMC10786448
DOI: 10.1038/s41596-023-00853-4

Review

Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease

Saori Sakaue et al. Nat Protoc. 2023 Sep.

. 2023 Sep;18(9):2625-2641.

doi: 10.1038/s41596-023-00853-4. Epub 2023 Jul 26.

Authors

Affiliations

¹ Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
² Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
³ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Kennedy Institute of Rheumatology, University of Oxford, Oxford, UK.
⁵ Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, South Korea.
⁶ Laboratory for Human Immunogenetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
⁷ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
⁸ Diabetes Unit, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
⁹ Center for Genomic Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
¹⁰ Program in Metabolism, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
¹¹ Institute of Genetic Epidemiology, Department of Genetics, Medical University of Innsbruck, Innsbruck, Austria.
¹² Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
¹³ Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA.
¹⁴ Institute for Biomedicine, Eurac Research, Bolzano, Italy.
¹⁵ Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, South Korea.
¹⁶ Research Unit for Evolutionary Immunogenomics, Department of Biology, University of Hamburg, Hamburg, Germany.
¹⁷ Data and Computational Sciences, Vertex Pharmaceuticals, Boston, MA, USA.
¹⁸ Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita, Japan.
¹⁹ Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Suita, Japan.
²⁰ Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI-IFReC), Osaka University, Suita, Japan.
²¹ Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
²² Center for Infectious Disease Education and Research (CiDER), Osaka University, Suita, Japan.
²³ Department of Genome Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
²⁴ Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
²⁵ Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
²⁶ Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. soumya@broadinstitute.org.
²⁷ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. soumya@broadinstitute.org.
²⁸ Centre for Genetics and Genomics Versus Arthritis, University of Manchester, Manchester, UK. soumya@broadinstitute.org.

PMID: 37495751
PMCID: PMC10786448
DOI: 10.1038/s41596-023-00853-4

Abstract

The human leukocyte antigen (HLA) locus is associated with more complex diseases than any other locus in the human genome. In many diseases, HLA explains more heritability than all other known loci combined. In silico HLA imputation methods enable rapid and accurate estimation of HLA alleles in the millions of individuals that are already genotyped on microarrays. HLA imputation has been used to define causal variation in autoimmune diseases, such as type I diabetes, and in human immunodeficiency virus infection control. However, there are few guidelines on performing HLA imputation, association testing, and fine mapping. Here, we present a comprehensive tutorial to impute HLA alleles from genotype data. We provide detailed guidance on performing standard quality control measures for input genotyping data and describe options to impute HLA alleles and amino acids either locally or using the web-based Michigan Imputation Server, which hosts a multi-ancestry HLA imputation reference panel. We also offer best practice recommendations to conduct association tests to define the alleles, amino acids, and haplotypes that affect human traits. Along with the pipeline, we provide a step-by-step online guide with scripts and available software ( https://github.com/immunogenomics/HLA_analyses_tutorial ). This tutorial will be broadly applicable to large-scale genotyping data and will contribute to defining the role of HLA in human diseases across global populations.

PubMed Disclaimer

Figures

**Extended Data Fig.1 ∣. The linkage disequilibrium (LD) patterns across the extended MHC region.**
A heatmap of LD r2 for pairwise variants across the extended MHC region. We used biallelic markers in our HLA reference panel within European populations and calculated LD r² values for exhaustive pairs of these variants. The variants are ordered (both on x-axis and y-axis) and annotated by HLA gene names (on x-axis) based on their genomic coordinates on chromosome 6. The bottom plot shows the detailed LD pattern in the class II region.

**Extended Data Fig.2 ∣. Schematic illustration of method used to construct scaffold variants within multi-ancestry HLA reference panel.**
We extracted SNP variants within MHC region in 1000 Genomes Project (1KG) samples. We only retained variants that were included in major genotyping arrays (Illumina Multi-Ethnic Genotyping Array, Global Screening Array, OmniExpressExome, and Human Core Exome), colored in teal. We then quality controlled each of the participating cohorts’ MHC SNPs separately, retained overlapping variants with selected SNPs in 1KG, and cross-imputed each cohorťs missing variants by using 1KG genotypes. We finally concatenate all cohorts together to construct scaffold variants for multi-ancestry reference panel.

**Extended Data Fig.3 ∣. Michigan Imputation Server.**
Example usage of Michigan Imputation Server for HLA imputation at https://imputationserver.sph.umich.edu/index.html.

**Extended Data Fig.4 ∣. The runtime benchmark for HLA imputation using different platforms.**
a. For SNP2HLA, we used BEAGLE4 for phasing and imputation algorithm (Luo et al. *Nat Genet.* 2021) with using 10 CPUs. For Minmac4, we used SHAPEIT2 as phasing algorithm with samples <10,000 and EAGLE2 as phasing algorithm with samples > 5,000 as we described in the manuscript both with using 10 CPUs. b. For Michigan Imputation Server, we uploaded the unphased genotype data and standard imputation pipeline was performed with default setting (with 1CPU).

**Fig.1 ∣. Location and structure of HLA genes on human chromosome 6 and their associations with human traits.**
a, A schematic representation of the human MHC locus highlighting the three main classes of the region, and the genes within them. The classical class I HLA genes are shown in yellow, the classical class II HLA genes in blue, the nonclassical HLA genes in purple and the genes other than the HLA genes within the MHC region in red. b, Presentation of an antigenic peptide by an antigen-presenting cell to a T cell through interaction between an MHC class II molecule and a TCR. The inset shows the protein structure of the MHC class II complex composed of HLA-DRA and DRB1 bound to an antigenic peptide (PDB ID: 3L6F). c, The number of traits associated with any variants within a 2 Mb genomic window with P < 5 × 10⁻⁸ among the 198 diseases and biomarkers in the UK Biobank and FinnGen. The MHC region is highlighted in red. The GWAS data for 198 diseases and biomarkers were obtained and analyzed as previously described.

**Fig.2 ∣. Overview of HLA imputation, association and fine mapping.**
a, A toy example illustrating the workflow for HLA imputation. The process begins with (1) either using an existing HLA imputation reference panel or creating a custom one, (2) collecting the input genotype in the MHC region from the target cohort without HLA types, (3) performing QC of the target genotype, (4) genotype phasing and imputation to predict the untyped HLA alleles in the target cohort (locally using the SNP2HLA software or online using the MIS), and results in (5) predicted HLA alleles in the target cohort. b, Statistical methods to investigate and fine-map association of (1) individual HLA alleles, (2) amino acid positions comprising multiple residues (highlighted in blue) and (3) their haplotypes (highlighted in red) with a trait of interest. See Fig. 3 for an overview of HLA allele nomenclature and structure.

**Fig.3 ∣. Nomenclature, structure and encoding of HLA alleles.**
a, The nomenclature of HLA alleles consists of four fields, with each field corresponding to the types and consequences of nucleotide variations. b, (Top) The amino acid sequences defining each of three example HLA-DRB1 alleles. The amino acids colored in red indicate the positions where they have variations among the alleles. The numbers (−25 and −24) at the bottom indicate the relative position of those amino acids within a coding region of HLA-DRB1. The negative positions indicate amino acids within a signal peptide, which is not part of the HLA protein presented on a cell surface. (Bottom) A procedure to code each of the HLA alleles and amino acid polymorphisms as binary markers: 1 if that marker is present within a haplotype and 0 otherwise. Each of the residues is coded separately for a given amino acid position in the corresponding HLA protein.

**Fig.4 ∣. A flow chart of suggested analytical steps for genotype QC and HLA imputation.**
A best-practice guideline to impute HLA alleles by using SNP2HLA algorithm, depending on the characteristics of the target genotype data.

**Fig.5 ∣. HLA Imputation quality in MIS.**
a–e. Dosage correlation r (y axis) between the MIS imputed dosage and true genotypes of all two-field alleles in 1 KG samples as a function of AF (x axis), colored by HLA gene, for all 1 KG individuals (a) or per ancestry (b–e). f, The accuracy (concordance) of the imputed dosage of all two-field alleles in 1 KG samples in the MIS and the true genotype of those per HLA gene and per ancestry. The accuracy metric was calculated as previously described. EUR, Europeans; EAS, East Asians; AMR, admixed Americans; AFR, Africans.

**Fig.6 ∣. Grouping of two-field alleles using the conditional haplotype test.**
a,b, An example illustration of the conditional haplotype test for the *HLA-DRB1* gene. In the first round of conditional haplotype test, (a), we group all two-field alleles (32 alleles in total) into six groups on the basis of the amino acid residues at position +11 and ask whether those groups significantly explain the disease risk by using the omnibus test. In the second round of conditional haplotype test (b; position +71 as an example), we group the two-field alleles into ten groups on the basis of the amino acid residues at positions +11 and +71. Then, we ask whether the full model with those ten groups explains the disease risk better than the reduced model with the six groups that we defined in the first round by the delta deviance using an F-test.

**Fig.7 ∣. Nonadditive test and multitrait analysis.**
a, Schematic illustrations of the additive and nonadditive models using the log odds ratio (log(OR)) according to the dosage of the genotype of interest. a denotes the purely additive effect by having one copy of the disease or trait-associated allele, and d denotes any departure from additivity for a heterozygous genotype. b, A logistic regression model to assess both the additive and nonadditive effect of the allele j (see main text for details). c, Multitrait analysis using a multiple linear regression model to test the association between the multidimensional phenotype Y and the amino acid polymorphism.

See this image and copyright information in PMC

References

1. Trowsdale J & Knight JC Major histocompatibility complex genomics and human disease. Ann. Rev. Genomics Hum. Genet 14, 301–323 (2013). - PMC - PubMed
1. Amiel J. in Histocompatibility Testing (ed. Teraski PI) 79–81 (Munksgaard, 1967).
1. Murphy K & Weaver C Janeway’s immunology. America 1–277 (2017).
1. Dendrou CA, Petersen J, Rossjohn J & Fugger L HLA variation and disease. Nat. Rev. Immunol 18, 325–339 (2018). - PubMed
1. Murphy K Kenneth M & Weaver C Janeway’s Immunobiology (Garland Science, 2016).

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease

Affiliations

Tutorial: a statistical genetics guide to identifying HLA alleles driving complex disease

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials