Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 6;25(1):208.
doi: 10.1186/s13059-024-03352-1.

DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants

Affiliations

DNA-binding factor footprints and enhancer RNAs identify functional non-coding genetic variants

Simon C Biddie et al. Genome Biol. .

Abstract

Background: Genome-wide association studies (GWAS) have revealed a multitude of candidate genetic variants affecting the risk of developing complex traits and diseases. However, the highlighted regions are typically in the non-coding genome, and uncovering the functional causative single nucleotide variants (SNVs) is challenging. Prioritization of variants is commonly based on genomic annotation with markers of active regulatory elements, but current approaches still poorly predict functional variants. To address this, we systematically analyze six markers of active regulatory elements for their ability to identify functional variants.

Results: We benchmark against molecular quantitative trait loci (molQTL) from assays of regulatory element activity that identify allelic effects on DNA-binding factor occupancy, reporter assay expression, and chromatin accessibility. We identify the combination of DNase footprints and divergent enhancer RNA (eRNA) as markers for functional variants. This signature provides high precision, but with a trade-off of low recall, thus substantially reducing candidate variant sets to prioritize variants for functional validation. We present this as a framework called FINDER-Functional SNV IdeNtification using DNase footprints and eRNA.

Conclusions: We demonstrate the utility to prioritize variants using leukocyte count trait and analyze variants in linkage disequilibrium with a lead variant to predict a functional variant in asthma. Our findings have implications for prioritizing variants from GWAS, in development of predictive scoring algorithms, and for functionally informed fine mapping approaches.

Keywords: Functional genetics; Functional genomics; Genome-wide association study; Non-coding genome; Non-coding variants; Single nucleotide polymorphism; Single nucleotide variants.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
A compendium of markers of active regulatory regions. A Heatmap for six datasets that mark active regulatory elements. H3K27ac data were obtained from two different sources. Footprints are derived from DNase-seq data. The number of biosources (cell lines, primary cells, or tissues) are indicated. For ChIP-seq datasets, this consists of 956 DNA-binding factors. B Boxplot for each feature length from each biosource with the number shown indicating the median length in base pairs (bp). N, total number of elements for each feature, summed for all biosources regardless of overlapping features. C Genomic annotations for markers of active regulatory markers from merged biosources. TTS, transcription termination site; TSS, transcription start site. D Pairwise Jaccard index for merged intervals from DNase-seq, DNase footprints, ATAC-seq, ChIP-seq, eRNA, and H3K27ac
Fig. 2
Fig. 2
Molecular quantitative trait loci (molQTL) with effects on regulatory element activity. A MolQTL with effects on regulatory element activity include binding QTLs (bQTL), reporter assay QTLs (raQTL) from massively parallel reporter assays (MPRA), and chromatin accessibility QTLs (caQTL). The model depicts binding of a DNA-binding factor, which is altered by a single nucleotide variation, thus reducing binding, reporter assay expression, and chromatin accessibility. B Heatmap showing the molQTL datasets and biosources. bQTL data are derived from ChIP-seq experiments of 1073 DNA-binding factors. C Proportional Venn diagram of the number of QTLs from each assay showing minimal overlap
Fig. 3
Fig. 3
Identifying functional variants using combinations of markers of active regulatory elements. A Heatmap of precision Z-scores for individual, or combinations of, markers for GWAS catalog variants. Variants from the GWAS catalog were intersected with each feature or combinations of features, benchmarked against molQTLs. Unsupervised clustering was performed and percentages of each feature represented in each cluster are shown below. B Fold change of precision scores for functional variant discovery by feature combinations from random genomic SNVs. Common (MAF ≥ 1%) SNVs from the whole genome were subsetted to a comparable number to the GWAS catalog. Precision scores for variants that intersect with feature(s), and co-localize with a molQTL, are expressed as fold change relative to the number of subsetted variants that co-localize with molQTL. Feature combinations are sorted from lowest to highest fold change. One hundred iterations of subsetting from dbSNP were performed and error bars shows standard deviation. C SHAP (SHapley Additive exPlanations) values for a machine learning model using a random forest classifier to identify feature importance for each active regulatory mark towards the functional variant predictive model
Fig. 4
Fig. 4
Centrality in the identification of functional variations, and benchmarking against non-coding Mendelian disease-associated variants. A Precision score of DHS cores and DHS central lengths around summits, compared to DNase footprints with eRNA. Precision scores are expressed as fold change of precision score for the indicated feature over the precision score for GWAS variants alone. Scores are benchmarked for each molQTL. B Distribution of nearest DHS footprint up- and downstream relative to DHS summits. C Genomic annotation of validated, rare, non-coding Mendelian disease-associated variants. D Precision-recall graph determined from 230 non-coding Mendelian disease-associated variants, with spike-in of random genomic variants from dbSNP (~ 4 K) to produce a validated variant probability of ~ 0.05. Features were intersected to determine precision (positive predictive value—PPV) and recall (sensitivity). Binary outcomes were determined from feature intersections
Fig. 5
Fig. 5
Prioritization of variants using DNase footprints and eRNA preserve cell-specific features and functions. A Genomic annotation of all variants for 53 traits from the GWAS catalog, and annotations when intersection with DNase footprints, eRNA, or both in combination. B Gene ontology (GO) analysis of proximity genes relative to variants of lymphocyte count trait, for all variants for the trait from the GWAS catalog, variants in DNase footprints, eRNA, or both. Significance of the GO term is expressed as − log10 of the FDR q value (FDRq). Top ten GO terms for all variants are shown with FDRq for each intersection. C Number of genes for the GO term “regulation of immune response” comparing number of genes from all variants compared to number of genes from variants in footprints and eRNA. D Enrichment analysis, using FORGE2, of cell-type specific active CREs for the lymphocyte count trait, comparing variants in DNase footprints and eRNA, compared to an equal number (333) of randomly selected variants from all lymphocyte count variants from the GWAS catalog. The random selection was repeated ten times, and the mean p value (− log10) is shown. E An example lymphocyte count-associated variant (rs12722502) at the IL2RA locus which co-localizes with a DNase footprint and eRNA. UCSC genome browser shot shows merged features and ENCODE DNase-seq bigwig tracks for each cell-/tissue-type. F Predicted impact of rs12722502 on DNA-binding factors. Scatter plot of the major and minor allele binding score for DNA-binding factor motifs showing predicted gain or loss of binding. Example DNA-binding factors are labeled
Fig. 6
Fig. 6
DNase footprints and eRNA can prioritize functional variants for variants in LD with a lead variant. A Using GWAS variants from 53 traits, mean precision and recall scores were determined by intersecting indicated feature(s) with GWAS and LD variants, against a combined set of QTL variants from bQTL, raQTL, and caQTL. LD variants were determined for each GWAS variant using an R2 > 0.7 For ATAC, intervals were obtained from GTRD, or ENCODE with IDR thresholding. Lines across the mean show the 95% confidence interval of the precision or recall score. B Precision and recall scores for GWAS and LD variants for the leukocyte count trait, asthma, and type 2 diabetes (T2DM). C Manhattan plot of the asthma-associated rs72823641 variant (red), with variants in LD ≥ 0.7 indicated in orange. Merged feature tracks are shown for the locus. Genome co-ordinates (Mb) are indicated. D rs10173081 overlaps a DNase footprint and eRNA. UCSC genome browser shot for the intronic region of IL1RL1 with rs10173081 overlapping a myeloid specific DHS. E rs10173081 is an eQTL for multiple genes in lung and whole blood. GTex violin plots for normalize expression of the indicated genes across homozygous major allele, heterozygous major and minor allele, and homozygous minor allele. F Predicted impact of rs10173081 on DNA-binding factors. Scatter plot of the major and minor allele binding score for DNA-binding factor motifs showing predicted gain or loss of binding. Example DNA-binding factors are labeled
Fig. 7
Fig. 7
The FINDER framework—Functional SNV IdeNtification using DNase footprints and Enhancer RNA. Markers of active regulatory elements as merged datasets from multiple cell- and tissue-types are used to predict functional variants. The combination of DNase footprints and eRNA provides the highest precision, which can be applied to GWAS traits or to predict functional variants in linkage disequilibrium (LD). Predicted variants can then be interrogated for cellular and molecular function, by deconvolving marker enrichment to predict relevant cell-types, or the predict altered binding affinity for DNA-binding factors

References

    1. Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, Wang H, Reynolds AP, Sandstrom R, Qu H, Brody J, et al. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–5. 10.1126/science.1222794. 10.1126/science.1222794 - DOI - PMC - PubMed
    1. Qi T, Wu Y, Fang H, Zhang F, Liu S, Zeng J, Yang J. Genetic control of RNA splicing and its distinct role in complex trait variation. Nat Genet. 2022;54:1355–63. 10.1038/s41588-022-01154-4. 10.1038/s41588-022-01154-4 - DOI - PMC - PubMed
    1. Johnston AD, Simões-Pires CA, Thompson TV, Suzuki M, Greally JM. Functional genetic variants can mediate their regulatory effects through alteration of transcription factor binding. Nat Commun. 2019;10:3472. 10.1038/s41467-019-11412-5. 10.1038/s41467-019-11412-5 - DOI - PMC - PubMed
    1. Maurano MT, Haugen E, Sandstrom R, Vierstra J, Shafer A, Kaul R, Stamatoyannopoulos JA. Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nat Genet. 2015;47:1393–401. 10.1038/ng.3432. 10.1038/ng.3432 - DOI - PMC - PubMed
    1. Jeong Y, Leskow FC, El-Jaick K, Roessler E, Muenke M, Yocum A, Dubourg C, Li X, Geng X, Oliver G, et al. Regulation of a remote Shh forebrain enhancer by the Six3 homeoprotein. Nat Genet. 2008;40:1348–53. 10.1038/ng.230. 10.1038/ng.230 - DOI - PMC - PubMed

LinkOut - more resources