Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jun 9:2023.06.07.544066.
doi: 10.1101/2023.06.07.544066.

Hidden protein-altering variants influence diverse human phenotypes

Affiliations

Hidden protein-altering variants influence diverse human phenotypes

Margaux L A Hujoel et al. bioRxiv. .

Update in

Abstract

Structural variants (SVs) comprise the largest genetic variants, altering from 50 base pairs to megabases of DNA. However, SVs have not been effectively ascertained in most genetic association studies, leaving a key gap in our understanding of human complex trait genetics. We ascertained protein-altering SVs from UK Biobank whole-exome sequencing data (n=468,570) using haplotype-informed methods capable of detecting sub-exonic SVs and variation within segmental duplications. Incorporating SVs into analyses of rare variants predicted to cause gene loss-of-function (pLoF) identified 100 associations of pLoF variants with 41 quantitative traits. A low-frequency partial deletion of RGL3 exon 6 appeared to confer one of the strongest protective effects of gene LoF on hypertension risk (OR = 0.86 [0.82-0.90]). Protein-coding variation in rapidly-evolving gene families within segmental duplications-previously invisible to most analysis methods-appeared to generate some of the human genome's largest contributions to variation in type 2 diabetes risk, chronotype, and blood cell traits. These results illustrate the potential for new genetic insights from genomic variation that has escaped large-scale analysis to date.

PubMed Disclaimer

Conflict of interest statement

Declaration of Interests The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. Haplotype-informed CNV detection from whole-exome sequencing in UK Biobank.
(a) This approach improves power to detect CNVs by analyzing whole-exome sequencing read-depth data from an individual together with corresponding data from individuals sharing extended SNP-haplotypes (“haplotype neighbors”), facilitating analysis at the resolution of 100bp bins. In contrast, standard approaches analyze data from an individual alone, generally at exon-level resolution. (b) Average number of CNVs called per UKB participant, subdivided by copy-number change (deletion/duplication) and call length. (c) Validation rate of CNV calls based on analysis of whole-genome sequencing data for 100 UKB participants. (d) Average numbers of CNVs called per UKB participant affecting given numbers of genes or exons. (e) Distributions (across increasingly constrained gene sets) of observed counts of predicted loss-offunction deletions and whole-gene duplications in 487,205 UKB participants. Centers, medians; box edges, 25th and 75th percentiles; whiskers, 5th and 95th percentiles. (f) Fractions of UKB participants with given numbers of genes affected by rare CNVs.
Figure 2:
Figure 2:. Association and fine-mapping analyses implicate rare large-effect CNVs and uncover new gene-trait relationships.
(a) Effect size versus minor allele frequency for 180 likely-causal CNV-phenotype associations, colored by phenotype category. (b) Number of genes with pLoF burden associations (P < 5 × 10−8) per trait, colored by phenotype category, with darker shading corresponding to associations detectable only upon including pLoF CNVs (i.e., P > 5 × 10−8 for burden masks considering only SNVs and indels). (c) Genomic locations of CCNF pLoF CNV calls; boxed calls correspond to the rare duplication spanning a single 107bp exon. (d) Effect sizes of CCNF pLoF CNVs for height and erythrocyte traits. (e) Consistency of height effect sizes of pLoF CNVs with those of pLoF SNVs and indels. (f) Replication of height effect sizes of pLoF CNVs in BioBank Japan (for newly-implicated genes with at least five pLoF CNV carriers in BBJ). Error bars, 95% CIs.
Figure 3:
Figure 3:. A low-frequency deletion in RGL3 associates with reduced hypertension risk and generates novel splicing.
(a) Associations of variants at the RGL3 locus with systolic blood pressure in two steps of stepwise conditional analysis. Colored dots are variants in partial LD (R2 ≥ 0.01) with labeled variants. (b) Effect sizes and allele frequencies of a common RGL3 missense variant (rs167479), the low-frequency 1.1 kb deletion, and a rare RGL3 stop gain. (c) Evidence of novel RGL3 splicing produced by the 1.1kb deletion. RNA sequencing read-depth data from GTEx are shown for a carrier of the deletion and a control sample (both thyroid); red arcs indicate novel splice junctions, labeled with counts of supporting RNA-seq reads. (d) Systolic and diastolic blood pressure effect sizes versus minor allele frequencies for nonsynonymous SNP and indel variants and the 1.1kb deletion. Error bars, 95% CIs.
Figure 4:
Figure 4:. Coding variants within segmental duplications underlie top genetic associations with type 2 diabetes and chronotype.
(a,b) Genome-wide associations with (a) type 2 diabetes (T2D) and (b) chronotype. (c) Associations of variation at 7q22.1 with chronotype and T2D. Associations of paralogous sequence variants (PSVs) within the 99kb repeat at this locus (1–7 copies per allele; 2 copies in GRCh37) are plotted in the center; green dashed line indicates association strength of copy number of the 99kb repeat. (d) Joint distribution of copy-number estimates for the 99kb segmental duplication and the RASA4 Y731C missense variant. (e) T2D prevalence and mean chronotype (in standardized units; higher for “evening people”) as a function of number of copies of the RASA4 Y731C missense variant. (f) T2D associations at the CTRB2 locus; colored dots are variants in partial LD (R2 > 0.01) with the CTRB2 exon 6 deletion. (g) Location of the 584bp deletion spanning CTRB2 exon 6 (top) and exome-sequencing read alignments for a deletion carrier (bottom); most reads aligned to the region paralogous to CTRB1 do not map uniquely and are colored white. (h) Scatter plot of normalized whole-genome a whole-exome sequencing read depths at CTRB2 exon 6. (i) Mean HbA1c and prevalence of T2D and pancreatic cancer as a function of CTRB2 exon 6 deletion genotype. Error bars, 95% CIs.
Figure 5:
Figure 5:. Variation in segmental duplications generates two of the top five genetic associations with basophil counts.
(a) Genome-wide associations with basophil counts. (b) Associations with basophil counts at the FCGR3B locus; colored dots are variants in partial LD (R2 > 0.01) with FCGR3B copy number. (c) Joint distribution of copy-number estimates for FCGR3A and FCGR3B. (d) Mean basophil count and prevalence of chronic obstructive pulmonary disease (COPD) as a function of FCGR3B copy number. (e) Associations with basophil counts at the DEFA1/DEFA3 locus. PSVs within the 19kb repeat at this locus are plotted as in Fig. 4c; green dashed line indicates association strength of copy number of the 19kb repeat. (f) Histogram of the number of copies of the 19kb repeat carrying the 5-SNP haplotype represented by chr8:6993547 C>A (GRCh38 coordinates). (g) Mean monocyte and basophil count as a function of copy number of the 5-SNP haplotype. Error bars, 95% CIs.
Figure 6:
Figure 6:. Common, pleiotropic SIGLEC14–SIGLEC5 gene fusion illustrates tissue-specific promoter activity.
(a) Gene diagram of SIGLEC14 and SIGLEC5. A common deletion allele fuses the SIGLEC14 promoter to the SIGLEC5 gene body, and the reciprocal duplication allele is also observed at lower frequencies. (b) Allele frequency of gene fusion and duplication events in UK Biobank, stratified by reported ethnicity. (c) Effect size of fusion on blood indices and serum biomarker traits. (d) Allelic fold change effect of fusion on SIGLEC5 and SIGLEC14 gene expression across GTEx tissues tracks with relative efficiency of SIGLEC14 promoter vs. SIGLEC5 promoter in each tissue. Error bars, 95% CIs.

References

    1. Sudmant P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). - PMC - PubMed
    1. Abel H. J. et al. Mapping and characterization of structural variation in 17,795 human genomes. Nature 583, 83–89 (2020). - PMC - PubMed
    1. Collins R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020). - PMC - PubMed
    1. Ebert P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021). - PMC - PubMed
    1. Liao W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023). - PMC - PubMed

Publication types