Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan;32(1):175-188.
doi: 10.1101/gr.275819.121. Epub 2021 Dec 7.

RefSeq Functional Elements as experimentally assayed nongenic reference standards and functional interactions in human and mouse

Affiliations

RefSeq Functional Elements as experimentally assayed nongenic reference standards and functional interactions in human and mouse

Catherine M Farrell et al. Genome Res. 2022 Jan.

Abstract

Eukaryotic genomes contain many nongenic elements that function in gene regulation, chromosome organization, recombination, repair, or replication, and mutation of those elements can affect genome function and cause disease. Although numerous epigenomic studies provide high coverage of gene regulatory regions, those data are not usually exposed in traditional genome annotation and can be difficult to access and interpret without field-specific expertise. The National Center for Biotechnology Information (NCBI) therefore provides RefSeq Functional Elements (RefSeqFEs), which represent experimentally validated human and mouse nongenic elements derived from the literature. The curated data set is comprised of richly annotated sequence records, descriptive records in the NCBI Gene database, reference genome feature annotation, and activity-based interactions between nongenic regions, target genes, and each other. The data set provides succinct functional details and transparent experimental evidence, leverages data from multiple experimental sources, is readily accessible and adaptable, and uses a flexible data model. The data have multiple uses for basic functional discovery, bioinformatics studies, genetic variant interpretation; as known positive controls for epigenomic data evaluation; and as reference standards for functional interactions. Comparisons to other gene regulatory data sets show that the RefSeqFE data set includes a wider range of feature types representing more areas of biology, but it is comparatively smaller and subject to data selection biases. RefSeqFEs thus provide an alternative and complementary resource for experimentally assayed functional elements, with future data set growth expected.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Workflow for RefSeqFE data set production. Full cylinders represent databases, the half-cylinder represents the indicated data source, and rectangles represent actions. Relevant links to additional information and data access are provided in Supplemental Table S1.
Figure 2.
Figure 2.
Example of a biological region RefSeqFE flat file. Segments of RefSeq accession NG_052895.1 representing the hemoglobin subunit beta locus control region (HBB-LCR) are shown. (A) Top section of the flat file with a link to BioProject accession PRJNA343958 and the “RefSeqFE” keyword outlined in red. (B) Segment of the feature annotation section. Features are displayed for the 5′HS5 DNase I hypersensitive site (Tuan et al. 1985; Dhar et al. 1990; Wai et al. 2003), a transcriptional cis-regulatory region (Long et al. 1998), a CTCF binding site (Farrell et al. 2002; Bulger et al. 2003; Chan et al. 2008), and an enhancer-blocking element (Farrell et al. 2002). Features include “/experiment” qualifiers with experimental evidence from the literature as indicated by ECO strings and IDs and links to publications (blue tabs), “/note” qualifiers with descriptive information (gray tabs), “/function” qualifiers describing the function of each feature where applicable (green tabs), and a “/bound_moiety” qualifier for the protein-binding site (red tab). All features include a “/db_xref” qualifier (black tabs) linking to the biological region record in the Gene database (GeneID:109580095), and an INSDC class qualifier when relevant (orange tabs).
Figure 3.
Figure 3.
Graphical displays of RefSeqFE data. (A) NCBI Genome Data Viewer display of genome-annotated features at the human opsin locus control region (OPSIN-LCR, GeneID:107604627, also shown in Supplemental Fig. S1). Underlying features are aggregated and displayed in the “Biological regions, aggregate” track (outlined in red). Depending on user track set options or the entry point to GDV, the track may need to be turned on via the configuration interface, as detailed on our web page (Supplemental Table S1, graphical displays link). Features are color coded according to class or type. Coordinates are based on positions on the genome sequence. An example of a mouseover-activated pop-up box is shown (overlaid gray box). These boxes contain descriptive and functional information (orange tab) (Nathans et al. 1989; Wang et al. 1992), including experimental evidence and links to publications, as well as a “Links & Tools” area (blue tab) linking to the related Gene database record and to sequences and BLAST analyses. (B) RefSeqFE Hub view of parental biological regions, underlying features, and gene regulatory and recombination partner interactions in the UCSC Genome Browser. Regulatory interactions are shown between the hemoglobin subunit alpha locus control region (HBA-LCR, GeneID:106144573) and the downstream HBZ, HBA2, HBA2, and HBQ1 genes (blue curved lines), whereas the recombination partners track visualizes recombination (green curved line) between two hemoglobin subunit alpha recombination regions (LOC106804612 and LOC106804613). Parental biological regions are denoted by black rectangles in the biological regions track, and the features track uses color coding as described for A. Further item-specific metadata, display options, and links to related data and tools can be found within item- and track-specific details pages. Depending on the density of interactions in a region, appropriate zoom levels or configuration modes may need to be adjusted, or specific hub settings such as multiregion view can be used for viewing interactions between distally located regions.
Figure 4.
Figure 4.
RefSeqFE feature distributions. (A) Categorized feature counts from human AR 109.20201120 on the GRCh38.p13 genome assembly with grouping by feature class. The pale blue labels indicate the feature counts per category; categories and a full breakdown of feature types and counts are available in Supplemental Table S2A. (B) Box plot showing feature length distributions for all human features (light gray) and individual feature classes, with coloring as in A. Some outliers (maximum length 141,940) are not displayed because the y-axis was scaled to better visualize the distributions of shorter features. Length distributions per feature type are provided in Supplemental Figure S3 with customized scaling for each class: n = 9862, 1357, 1379, 926, and 6200 sample points. Additional statistics including minimums, maximums, averages, and standard deviations from the mean are provided in Supplemental Table S2A. (C) Categorized feature counts from mouse AR 109 on the GRCm39 genome assembly as shown for human in A. (D) Box plot showing feature length distributions for all mouse features (light gray) and individual feature classes, as described for human in B: n = 2271, 109, 690, and 1472 sample points. Additional details are provided in Supplemental Figure S4 and Supplemental Table S2A. (E) Summary table with overall counts of annotated features, biological region loci, and genome coverage for the indicated AR.
Figure 5.
Figure 5.
Locations of RefSeqFE features relative to genes. (A) Locations of features from human AR 109.20201120 compared to NCBI-annotated genes and subparts from the same AR. The horizontal bar graph shows the overall locations (gene-overlapping or intergenic), whereas the bar-of-pie chart shows more detailed locations. Blue tones denote genes and subparts, and gray tones denote intergenic regions. The pale blue labels indicate overlapping feature counts for each location, as shown for called overlaps in Supplemental Table S3A. (B) Locations of features from mouse AR 109 as shown for human in A. (C) Violin plot showing completeness of human RefSeqFE feature overlaps (overlap length/RefSeqFE feature length) at each gene-relative location (blue- and gray-tone coloring as in A) and cumulative results for all locations (blue-gray distribution at left): n = 25,029, 5468, 2084, 4373, 743, 1735, 5235, 1906, and 3485 sample points. Supporting statistics (Fisher P-values, Jaccard statistics, degree of overlap minimums, maximums, averages, and standard deviations) are provided in Supplemental Table S3A. (D) Violin plot showing completeness of mouse feature overlaps at each gene-relative location as described for human in C: n = 5810, 1249, 502, 981, 97, 459, 1237, 578, and 707 sample points. Supporting statistics are provided in Supplemental Table S3A. (E) Biotype statistics for genes that are overlapped by RefSeqFE features. The count columns indicate the number of distinct genes overlapped by one or more features, whereas the percentage total columns indicate percentages of the total number of genes (2455 human, 565 mouse) overlapped by RefSeqFE features for each biotype.
Figure 6.
Figure 6.
Comparison of RefSeqFEs to other gene regulatory data sets. (A) Overview showing data derivation, feature type representation, and current sizes of each data set on the human GRCh38.p13 and mouse GRCm39 reference assemblies. Additional information for each data set is provided in Supplemental Table S5A. (B) Bar graph showing human AR 109.20201120 RefSeqFE feature intersections with the indicated data sets, for which the y-axis represents the percent of input RefSeqFE features showing overlap. All features in comparative data sets were intersected with either all RefSeqFE features (medium blue bars), RefSeqFE regulatory features (gray bars), or RefSeqFE enhancer features (light blue bars). Enhancer features from each data set were additionally intersected with RefSeqFE enhancer features (dark blue bars). Full statistics including input and overlapping feature counts, overlap percentages with respect to each data set, Fisher P-values, and Jaccard statistics are provided in Supplemental Table S5B, with raw intersection output, feature lengths, and degrees of overlap with respect to each data set in Supplemental Table S5D. Data sets showing overlap with each RefSeqFE feature are also indicated in Supplemental Table S3B, column G. (C) Box plot showing feature length distributions for the indicated human data sets. Some outliers and dbSUPER feature lengths (maximum 498572) are not displayed because the y-axis was scaled to better visualize shorter feature distributions; see Supplemental Figure S6A for a 50-kb y-axis scale with dbSUPER data included: n = 9862, 926,535, 622,457, 63,285, and 1989 sample points. Additional statistics including minimums, maximums, averages, and standard deviations from the mean are provided in Supplemental Table S5A. (D) Bar graph showing mouse AR 109 RefSeqFE feature intersections with the indicated data sets, as described for human in B. Supporting details are provided in Supplemental Tables S3C and S5C,E. (E) Box plot showing feature length distributions for the indicated mouse data sets, as described for human in C: n = 2271, 343,747, 364,670, 49,802, and 1291 sample points. Supporting details are provided in Supplemental Table S5A and Supplemental Figure S6.

References

    1. Ahituv N. 2016. Exonic enhancers: proceed with caution in exome and genome sequencing studies. Genome Med 8: 14. 10.1186/s13073-016-0277-0 - DOI - PMC - PubMed
    1. Aken BL, Achuthan P, Akanni W, Amode MR, Bernsdorff F, Bhai J, Billis K, Carvalho-Silva D, Cummins C, Clapham P, et al. 2017. Ensembl 2017. Nucleic Acids Res 45: D635–D642. 10.1093/nar/gkw1104 - DOI - PMC - PubMed
    1. Albert FW, Kruglyak L. 2015. The role of regulatory variation in complex traits and disease. Nat Rev Genet 16: 197–212. 10.1038/nrg3891 - DOI - PubMed
    1. Amberger JS, Bocchini CA, Scott AF, Hamosh A. 2019. OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res 47: D1038–D1043. 10.1093/nar/gky1151 - DOI - PMC - PubMed
    1. Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al. 2014. An atlas of active enhancers across human cell types and tissues. Nature 507: 455–461. 10.1038/nature12787 - DOI - PMC - PubMed

Publication types

LinkOut - more resources