Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug;632(8023):122-130.
doi: 10.1038/s41586-024-07708-2. Epub 2024 Jul 17.

Sources of gene expression variation in a globally diverse human cohort

Affiliations

Sources of gene expression variation in a globally diverse human cohort

Dylan J Taylor et al. Nature. 2024 Aug.

Abstract

Genetic variation that influences gene expression and splicing is a key source of phenotypic diversity1-5. Although invaluable, studies investigating these links in humans have been strongly biased towards participants of European ancestries, which constrains generalizability and hinders evolutionary research. Here to address these limitations, we developed MAGE, an open-access RNA sequencing dataset of lymphoblastoid cell lines from 731 individuals from the 1000 Genomes Project6, spread across 5 continental groups and 26 populations. Most variation in gene expression (92%) and splicing (95%) was distributed within versus between populations, which mirrored the variation in DNA sequence. We mapped associations between genetic variants and expression and splicing of nearby genes (cis-expression quantitative trait loci (eQTLs) and cis-splicing QTLs (sQTLs), respectively). We identified more than 15,000 putatively causal eQTLs and more than 16,000 putatively causal sQTLs that are enriched for relevant epigenomic signatures. These include 1,310 eQTLs and 1,657 sQTLs that are largely private to underrepresented populations. Our data further indicate that the magnitude and direction of causal eQTL effects are highly consistent across populations. Moreover, the apparent 'population-specific' effects observed in previous studies were largely driven by low resolution or additional independent eQTLs of the same genes that were not detected. Together, our study expands our understanding of human gene expression diversity and provides an inclusive resource for studying the evolution and function of human genomes.

PubMed Disclaimer

Conflict of interest statement

A. Battle is a co-founder and equity holder of CellCipher, a stockholder in Alphabet, and has consulted for Third Rock Ventures. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. A globally diverse transcriptomics dataset.
a, RNA-seq data were generated from LCLs from 731 individuals from the 1KGP, roughly evenly distributed across 26 populations and 5 continental groups. Populations included in MAGE are indicated in pink, whereas the Maasai population is in blue as it is present in the AFGR dataset (based on sequencing of HapMap cell lines) but not in the 1KGP or MAGE. Full population descriptors can be found at https://catalog.coriell.org/1/NHGRI/About/Guidelines-for-Referring-to-Populations. b, Genotype principal component 1 (PC1) and PC2 comparing MAGE to other large studies with paired RNA and whole-genome sequencing data. Samples from the specified study (that is, MAGE, Geuvadis, GTEx and AFGR) are depicted with coloured points, whereas samples from other studies are depicted with grey points in each respective panel. c, Proportion of variance explained by the first ten PCs. d, ADMIXTURE results displaying proportions of individual genomes (columns) attributed to inferred ancestry components. For MAGE, Geuvadis and AFGR, samples are stratified according to population and continental group labels from the respective source projects, whereas GTEx does not include population labels. A subset of 1KGP samples are present across multiple RNA-seq studies and therefore appear in multiple panels, but these samples were not duplicated within the input to ADMIXTURE. Ancestry components are modelling constructs that do not directly correspond to true ancestral populations, and the results of ADMIXTURE analysis strongly depend on sampling characteristics of the input data. Although k = 7 minimizes the cross-validation error within this combined dataset (Supplementary Fig. 1), alternative choices of k reflect structure at different scales (Supplementary Fig. 2). Map in a adapted from the US CIA World Factbook, 2005.
Fig. 2
Fig. 2. Patterns of transcriptomic diversity within and between populations.
a, Per gene estimates of the proportion of variance in gene expression level that is partitioned between continental groups and populations, as opposed to within continental groups or populations. b, Variance in expression level per gene (n = 20,154 genes) differs across continental groups, consistent with underlying differences in levels of genetic variation. Variance in expression level differs between continental groups (P < 1 × 10−10, one-tailed analysis of deviance). c, Per splicing cluster estimates of the proportion of variance in alternative splicing (intron excision ratios) that is partitioned between continental groups and populations, as opposed to within continental groups or populations. d, Variance in alternative splicing (intron excision ratios) per splicing cluster (n = 32,867 splicing clusters) differs across continental groups, consistent with underlying differences in levels of genetic variation. Variance in splicing differs between continental groups (P < 1 × 10−10, one-tailed analysis of deviance). In b and d, bars represent the first, second (median) and third quartiles of the data and whiskers are bound to 1.5× the interquartile range.
Fig. 3
Fig. 3. Mapping high-resolution eQTLs.
a, Number of credible sets per eGene, demonstrating evidence of widespread allelic heterogeneity, whereby multiple causal variants independently modulate expression of the same genes. b, Fine-mapping resolution, defined as the number of variants per credible set. c, A signature of negative selection against expression-altering variation, whereby eGenes under strong evolutionary constraint (defined as the top pLI decile reflecting intolerance to loss-of-function mutations; pink) possess fewer credible sets, on average, than other genes (blue). d, A signature of negative selection against expression-altering variation, whereby eQTLs of genes under strong evolutionary constraint (top pLI decile; pink) have smaller average effect sizes (aFC) than other genes (blue).
Fig. 4
Fig. 4. Fine-mapped cis-QTLs are strongly enriched in regulatory regions across multiple cell and tissue types.
a, A heatmap representing hierarchical clustering of the enrichment of cis-eQTLs in predicted chromatin states using the Roadmap Epigenomics 15-state chromHMM model across 127 cell and tissue samples. b, Distribution of absolute value of lead cis-eQTL effect sizes measured as log2(aFC) across putatively active chromatin states of LCLs linked to multi-tissue DHSs. Sample sizes describe the number of unique eVariants annotated as belonging to each of the DHS categories. Bars represent the first, second (median) and third quartiles of the data and whiskers are bound to 1.5× the interquartile range. c, Enrichment of lead sQTLs (n = 13,107 unique sVariants total, at least 5 per category) within functional annotation categories from Ensembl Variant Effect Predictor (left), along with the proportion of all lead sQTLs falling into each annotation category (right). Error bars denote 95% CI around the estimated sQTL enrichment in each category. Enrichment was calculated in comparison to a background set of variants matched on MAF and distance from the TSS. Annotation categories are not mutually exclusive and therefore sum to a proportion greater than 1. ES, embryonic stem; HSC, haematopoietic stem cell; iPS, induced pluripotent stem; Mesench, mesenchymal cell; Myosat, myosatellite cell; Neurosph, neurosphere; Sm., smooth; TssA, active TSS; TssAFlnk, flanking active TSS; TxFlnk, transcription at gene 5′ and 3′; Tx, strong transcription; TxWk, weak transcription; EnhG, genic enhancer; Enh, enhancer; ZNF/Rpts, ZNF genes plus repeats; Het, heterochromatin; TssBiv, bivalent/poised TSS; BivFlnk, flanking bivalent TSS/enhancer; EnhBiv, bivalent enhancer; ReprPC, repressed polycomb; ReprPCWk, weak repressed polycomb; Quies, quiescent/low; NMD, nonsense-mediated mRNA decay; LOF, loss of function; Hc, high confidence; Lc, low confidence.
Fig. 5
Fig. 5. Population-specific genetic effects on gene expression.
a, Joint distribution of AFs of fine-mapped lead eQTLs across continental groups in the 1KGP, stratifying on replication status in GTEx. Variants are categorized as unobserved (U; MAF = 0), rare (R; MAF < 0.05) or common (C; MAF ≥ 0.05) within each continental group. b, AF of a lead eQTL of GSTP1 (rs115070172) across populations from the 1KGP. c, Expression of GSTP1, stratified by genotype of rs115070172. Sample sizes describe the number of MAGE samples with each genotype. d, Expression of GSTP1, stratified by population label (PEL versus non-PEL). Sample sizes describe the number of MAGE samples in each population category. e, Mean FST value between the focal continental group and all other groups of lead eQTLs, stratifying by differential expression decile (contrasting the focal continental group with all other groups) of respective eGenes, where the 10th decile represents the most significant differential expression (n = 9,807 genes total). Differential expression P values (two-tailed) were obtained for each gene using a negative binomial generalized linear model contrasting each continental group with all other samples. For ce, bars represent the first, second (median) and third quartiles of the data and whiskers are bound to 1.5× the interquartile range. For e, data outside whisker ranges are shown as dots. f, Number of significant genotype-by-continental group interactions at varying P value thresholds (dashed line denotes Bonferroni threshold) for a model that considers a single causal variant (left) versus multiple potential causal variants per gene (right), stratifying by the number of credible sets for the gene. P values are one-tailed and are obtained from a F-test. Map in b reproduced using the browser described in ref. , under a Creative Commons licence CC BY 4.0.
Extended Data Fig. 1
Extended Data Fig. 1. Distribution of eQTL effect sizes.
(A) Distribution of lead eQTL effect sizes from autosomal genes, measured as log2(aFC). This distribution is expected to be roughly symmetric as, for each variant, the sign of the effect is entirely dependent on which allele is denoted the reference allele. Vertical dotted lines denote a two-fold change to expression (log2(aFC) = ±1). Most eQTLs have a relatively small effect on expression level. (B) Cumulative distribution of the absolute value of effect size across lead eQTLs from autosomal genes. Only 2031 (13%) lead eQTLs had greater than a twofold effect on gene expression (median |log2(aFC)| = 0.30).
Extended Data Fig. 2
Extended Data Fig. 2. Mapping of high-resolution sQTLs.
(A) Number of credible sets per sIntron, where we define sIntrons as all introns (that passed filtering) for autosomal genes identified as sGenes in the FastQTL permutation pass. We ran SuSiE separately for each sIntron. (B) Resolution of sIntron fine-mapping, defined as the number of variants per credible set. (C) After fine-mapping, overlapping intron-level credible sets were iteratively merged to produce gene-level credible sets. Panel C shows the number of merged credible sets per sGene. (D) The resolution of sGene fine-mapping, defined as the number of variants per merged credible set. These results demonstrate evidence of widespread allelic heterogeneity whereby multiple causal variants independently modulate splicing patterns of the same genes.
Extended Data Fig. 3
Extended Data Fig. 3. Evidence of negative selection on expression-altering variation across a range of mutational constraint metrics.
(A) Top row: number of credible causal sets for genes in (pink) and outside (blue) the top decile of various gene-level constraint metrics (pLI, LOEUF, pHaplo, pTriplo, hs, RVIS) obtained from the literature. P-values are two-tailed and based on quasi-Poisson generalized linear models that include mean expression level as a covariate. Bottom row: effect sizes (|log2(aFC)|) of lead eQTLs within (pink) and outside (blue) the same categories. P-values are two-tailed and based on Wilcoxon rank sum tests. (B) Same as panel A, but for mean PhyloP scores summarizing conservation among genome sequence alignments of 447 mammals within putative promoter elements, defined based on intervals around the TSS ([−1000, 1000] bp, [−500, 0] bp, [−50, 0] bp).
Extended Data Fig. 4
Extended Data Fig. 4. Population stratification of eQTLs.
Geographic frequencies of autosomal lead eQTLs found in MAGE across (A) all lead eQTLs, (B) excluding variants with allele frequencies > 5% across all continental groups, (C) only including variants unobserved in the European continental group, and (D) only including variants unobserved in both the European and African continental groups. The geographic distributions are sorted with the most common at the bottom and rarest at the top. Allele frequencies are categorized as unobserved (U), rare variants with allele frequencies <5% (R), and common variants (C) with allele frequencies ≥5%.
Extended Data Fig. 5
Extended Data Fig. 5. Population stratification of sQTLs.
Geographic frequencies of autosomal lead sQTLs found in MAGE across (A) all lead sQTLs, (B) excluding variants with allele frequency > 5% across all continental groups, (C) only including variants unobserved in the European continental group, and (D) only including variants unobserved in both the European and African continental groups. Allele frequencies are categorized as unobserved (U), rare variants with allele frequencies <5% (R), and common variants (C) with allele frequencies ≥5%.
Extended Data Fig. 6
Extended Data Fig. 6. GTEx DAP-G fine-mapping signals that do not replicate in MAGE are largely tissue-specific.
Comparison of number of tissues contained by 79,915 cross-tissue merged credible sets (ctmCS) from GTEx that do not replicate in MAGE against 7,913 ctmCS that replicate in MAGE. The number of tissues is defined as the number of tissues across all variants included in a ctmCS.
Extended Data Fig. 7
Extended Data Fig. 7. Results of eQTL genotype-by-principal component interaction test.
Analogous to Fig. 5f, the number of significant genotype-by-principal component interactions at varying p-value thresholds (dashed line denotes Bonferroni threshold) for a model that considers a single causal variant for each gene (left panel) versus a model that jointly considers multiple potential causal variants per gene (right panel). Genotype-by-principal component interactions were included for each of the top 5 global genotype principal components. Results are stratified by the number of credible sets for the gene (from one to five or greater). As in Fig. 5f, p-values are one-tailed and are obtained from an F-test comparing a model with an interaction term to one without.
Extended Data Fig. 8
Extended Data Fig. 8. High concordance in credible set effect sizes across continental groups.
(A) Heatmap showing the fraction of shared causal signals where log2(aFC) is not significantly different in pairs of continental groups after Bonferroni correction. n represents the total number of shared causal signals in each pair of continental groups. (B) Scatterplots comparing log2(aFC) within pairs of continental groups. Points are colored by whether the effect sizes are significantly different. The black line plots y = x (i.e., theoretical identical effect sizes). The gray line plots the best fit linear trendline. Significance is based on Bonferroni-corrected p-values (two-tailed) from a Welch’s t-test of log2(aFC) values in each pair of continental groups.

Update of

References

    1. Li, Y. I. et al. RNA splicing is a primary link between genetic variation and disease. Science352, 600–604 (2016). - PMC - PubMed
    1. Brem, R. B., Yvert, G., Clinton, R. & Kruglyak, L. Genetic dissection of transcriptional regulation in budding yeast. Science296, 752–755 (2002). - PubMed
    1. Morley, M. et al. Genetic analysis of genome-wide variation in human gene expression. Nature430, 743–747 (2004). - PMC - PubMed
    1. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature501, 506–511 (2013). - PMC - PubMed
    1. GTEx Consortium. Genetic effects on gene expression across human tissues. Nature550, 204–213 (2017). - PMC - PubMed

MeSH terms

LinkOut - more resources