Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 20;20(1):97.
doi: 10.1186/s13059-019-1707-2.

Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

Affiliations

Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

Mark T W Ebbert et al. Genome Biol. .

Abstract

Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions.

Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls.

Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Keywords: 10x Genomics; APOE; Alzheimer’s Disease Sequencing Project (ADSP); CR1; Camouflaged genes; Dark genes; Long-read sequencing; Oxford Nanopore Technologies (ONT); Pacific Biosciences (PacBio).

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Genomic regions may be “dark” by depth or mapping quality, many of which are “camouflaged”. Large, complex genomes are known to contain “dark” regions where standard high-throughput short-read sequencing technologies cannot be adequately assembled or aligned. We split these dark regions into two types: (1) dark because of low depth and (2) dark because of low mapping quality (MAPQ), which are mostly “camouflaged”. a HLA-DRB5 encodes a Major Histocompatibility Complex protein that plays an important role in immune response and has been associated with several diseases, including Alzheimer’s disease. It is well known to be dark (low depth); specifically, when performing whole-genome sequencing using standard short-read sequencing technologies, an insufficient number of reads align, preventing variant callers from assessing mutations. We calculated sequencing depth across HLA-DRB5 for ten male samples from the Alzheimer’s Disease Sequencing Project (ADSP) that were sequenced using standard Illumina whole-genome sequencing with 100-nucleotide read lengths. Approximately 63.5% (49.6% of coding sequence) of HLA-DRB5 is dark by depth (≤ 5 aligned reads; indicated by red lines). b HSPA1A is a heat-shock protein from the 70-kilodalton (kDa) heat-shock protein family and plays an important role in stabilizing proteins against aggregation. HSPA1A is dark because of low mapping quality (MAPQ < 10 for ≥ 90% of reads at a given position). Approximately 41.1% (53.0% coding sequence) of HSPA1A is dark by mapping quality (indicated by red line). Dark gray bars indicate sequencing reads with a relatively high mapping quality, whereas white bars indicate reads with a low mapping quality (MAPQ = 0). c Many genomic regions that are dark because of mapping quality arise because they have been duplicated in the genome, which we term “camouflaged” (or “camo genes”). When confronted with a read that aligns equally well to more than one location, standard sequence aligners randomly assign the read to one location and give it a low mapping quality. Thus, it is unclear from which gene any of the reads indicated by white bars originated from. HSPA1A and HSPA1B are clear examples of camouflaged genes arising from a tandem duplication. The two genes are approximately 14 kb apart and approximately 50% of the genes are identical
Fig. 2
Fig. 2
Many dark regions involve protein-coding gene regions. We identified 36,794 dark regions (> 15 million nucleotides) in 6054 gene bodies that were either dark by depth or dark by mapping quality. a Stratifying the gene bodies by GENCODE biotype, 3804 gene bodies were protein coding, 1232 were pseudogenes, and 753 were long intergenic non-coding RNAs (lincRNA). b Of all 36,794 dark regions, 27,982 were intronic, 4351 were in ncRNA exons, 2855 were in protein-coding exons (CDS), 908 were in 5′UTR regions, and 698 were in 3′UTR regions. Any dark region that spanned a gene element boundary (e.g., intron to exon) was split into separate dark regions
Fig. 3
Fig. 3
Dark coding regions occur throughout the genome and are largely resolved with long-read sequencing technologies. We identified 2855 dark coding (CDS) regions in 748 protein-coding genes that were dark by either depth or mapping quality (Additional file 2: Table S1; Additional file 3: Table S2). We identified 117 (15.6%) of the 748 protein-coding genes were 100% dark in CDS regions, 402 (53.7%) were at least 25% dark in CDS regions, and 592 (79.1%) were at least 5% dark in CDS regions (Additional file 2: Table S1). a We mapped all protein-coding gene bodies with a dark coding exon to the genome to visualize their genomic location and are generally spread throughout. There are several tight clusters of dark CDS regions on chromosomes 1, 9, 10, and Y, however. b We assessed how well increasing read lengths would resolve dark regions by assessing samples sequenced with Illumina whole-genome sequencing using 250-nucleotide read lengths, as well as long-read technologies 10x Genomics, Oxford Nanopore Technologies (ONT), and Pacific Biosciences (PacBio). Data from the samples sequenced using 250-nucleotide Illumina read lengths reduced the area under the curve (AUC) by 12.1% in CDS regions. Comparing long-read sequencing technologies to the standard Illumina 100-nucleotide read lengths, 10x Genomics, PacBio, and ONT reduced the area under the curve for CDS regions by approximately 49.5%, 64.4%, and 90.4%, respectively. The AUC for each technology is scaled in reference to Illumina sequencing based on 100-nucleotide read lengths (i.e., AUC for Illumina 100-nucleotide read lengths = 1). In contrast to overall results, PacBio outperformed 10x Genomics when looking only at CDS regions (see text). Most analyses focused on genes where at least 5% of the CDS nucleotides are dark, indicated by the dashed line. c, d We also calculated the raw number of dark nucleotides for each technology in GRCh38, genome wide, in full gene bodies, and in CDS regions
Fig. 4
Fig. 4
Pathways relevant to human health, development, and reproductive function are affected by dark and camouflaged genes. We characterized the pathways for dark and camouflaged genes using Metascape.org, including only genes where at least 5% of the CDS regions were dark (565 unique gene symbols; based on standard Illumina 100 nucleotide read lengths). a Specific pathway groups included Ub-specific processing proteases (R-HSA-5689880; logP = − 10.70), defensins (R-HSA-1461973; logP = − 9.43), ncRNA 3′-end processing (GO:0043628; logP = − 8.87), gonadal mesoderm development (GO:0007506; logP = − 8.76), spermatogenesis (GO:0007283; logP = − 8.29), spindle assembly (GO:0051225; logP = − 7.56), NLS-bearing protein import into nucleus (GO:0006607; logP = − 6.63), methylation-dependent chromatin silencing (GO:0006346; logP = − 4.98), activation of GTPase activity (GO:0090630; logP = − 4.67), and others. b Looking specifically at known protein-protein interactions, we found 103 proteins with 172 known interactions (Additional file 1: Figure S3) and, within those, identified four groups enriched for protein-protein interactions using the MCODE algorithm [28] (Fig. 4b). All four MCODE groups combined are primarily associated with RNA transport (hsa030313; logP = − 18.59; Additional file 1: Figure S4; accessed March 2019). Individually, the first group (MCODE1) is enriched for proteins involved in systemic lupus erythematosus (hsa05322; logP = − 6.55), cellular response to stress (R-HSA-2262752; logP = − 6.13), and RNA transport (hsa03013; logP = − 4.26; Additional file 1: Figure S5). The second group (MCODE2) is enriched with proteins involved in NLS-bearing protein import into nucleus (GO:0006607; logP = − 18.44; Additional file 1: Figure S6). The third and fourth groups do not have significant enrichment associations, likely because little is known about them; five of the six genes (PRR20C, PRR20D, PRR20E, SMN1, and SMN2) are completely or nearly 100% camouflaged, and several do not even have known expression measurements in GTEx [29] (Additional file 1: Figures S7-S9)
Fig. 5
Fig. 5
Seventy-six dark genes (≥ 5% CDS) are associated with 326 human diseases, including autism, inflammatory bowel disease, and others. We found 76 genes ≥ 5% dark CDS that harbor mutations associated with 326 unique human diseases, according to the Human Gene Mutation Database (HGMD). a Some of the diseases with the most known associated genes include autism spectrum disorder, schizophrenia, hearing loss, spinal muscular atrophy, and inflammatory bowel disease. Word size represents the number of genes associated with each disease. These data demonstrate the number of diseases impacted by genes that are at least 5% dark CDS, and how important it is to completely resolve dark regions. We also performed an enrichment analysis, where the diseases most enriched for dark genes included color blindness (protan color vision defect), X-linked cone-rod dystrophy, and spinal muscular atrophy (Additional file 1: Figure S10). b Similarly, we quantified the number of diseases each gene was associated with and identified many disease-relevant genes with large portions of dark CDS regions that may harbor critical disease-modifying mutations that currently go undetected. Some of the genes with the most known disease associations include ARX (12.8% dark CDS), NEB (9.5% dark CDS), TBX1 (10.6% dark CDS), RPGR (8.6% dark CDS), HBA2 (9.5% dark CDS), and CR1 (26.0% dark CDS). CR1 is particularly notable for neuroscientists and Alzheimer’s disease geneticists, patients, and their caregivers, given that CR1 is a top-ten Alzheimer’s disease gene. Other notable genes include SMN1 (94.6% dark CDS) and SMN2 (88.0% dark CDS), which are known to harbor mutations (in camouflaged regions) that are involved in spinal muscular atrophy (SMA) and are implicated in ALS. HSPA1A (53.0% dark CDS) and HSPA1B (51.5% dark CDS) also encode two primary 70-kilodalton (kDa) heat-shock proteins. Heat-shock proteins have been implicated in ALS [31, 32]
Fig. 6
Fig. 6
Camouflaged genes are consistently dark in gnomAD, but dark-by-depth genes may be sample or dataset specific. Many dark genes are specifically camouflaged (Additional file 13: Table S12; Additional file 14: Table S13), but many are dark by depth; we found that camouflaged regions in the ADSP are consistently dark in the gnomAD consortium data (http://gnomad.broadinstitute.org/) [36]. Dark-by-depth regions may be more variable between samples and datasets, however, suggesting these regions may be sensitive to specific aspects of whole-genome sequencing (e.g., library preparation) or downstream analyses. a SMN1 and SMN2 are camouflaged by each other (only SMN1 shown). Both genes contribute to spinal muscular atrophy and have been implicated in ALS. b HSPA1A and HSPA1B are also camouflaged by each other (only HSPA1A shown). The heat-shock protein family has been implicated in ALS. c NEB (9.5% dark CDS) is a special case that is camouflaged by itself. NEB is associated with 24 diseases in the HGMD, including nemaline myopathy, a hereditary neuromuscular disorder. NEB is a large gene; thus, 9.5% dark CDS translates to 2424 protein-coding bases. d CR1 is a top Alzheimer’s disease gene that plays a critical role in the complement cascade as a receptor for the C3b and C4b complement components, and potentially helps clear amyloid-beta (Aβ) [–39]. CR1 is also camouflaged by itself, where the repeated region includes the extracellular C3b and C4b binding domain. The number of repeats and density of certain isoforms have been associated with Alzheimer’s disease [, –43]. e HLA-DRB5 is dark by depth in the ADSP and gnomAD data. HLA-DRB5 has been implicated in several diseases, including Alzheimer’s disease. f RPGR is likewise dark in ADSP and gnomAD and is associated with several eye diseases, including retinitis pigmentosa and cone-rod dystrophy. g ARX is dark-by-depth, but varies by sample or cohort, as approximately 70% of gnomAD samples are not strictly dark by depth. ARX is associated with diseases including early infantile epileptic encephalopathy 1 (EIEE1) and Partington syndrome. h Similarly, TBX1 is not strictly dark by depth in approximately 70% of gnomAD samples. The Y axes for figures af indicate median coverage in gnomAD (blue = exomes; green = genomes), whereas the Y axes in g, h represent the proportion of gnomAD samples that have > 5x coverage. Dark and camouflaged regions, as well as the percentage of each gene’s CDS region that is dark, are indicated by red lines. Dark regions in exome data are either similar or more pronounced than what we observed in whole-genome data, highlighting that dark and camouflaged regions are generally magnified in whole-exome data. For interest, we also discovered that APOE—the top genetic risk for Alzheimer’s disease [–46]—is approximately 6% dark CDS (by depth) for certain ADSP samples with whole-genome sequencing, and the same region is dark in gnomAD whole-exome data (Additional file 1: Figure S11)
Fig. 7
Fig. 7
Long-read technologies resolve many camouflaged regions, with variable success. We found that ONT’s long-read technology appeared to resolve all camouflaged regions well with the high sequencing depth. PacBio performed similarly well, and 10x Genomics performs well under certain circumstances. a SMN1 and SMN2 were 94.6% and 88.0% camouflaged CDS, respectively, based on standard Illumina sequencing with 100-nucleotide read lengths (illuminaRL100). Both genes were 0% camouflaged CDS for 10x Genomics, PacBio, and ONT data. 10x Genomics and ONT perform particularly well in these genes, with consistently high mapping coverage. b HSPA1A and HSPA1B were 53.0% and 51.5% camouflaged CDS, respectively, based on illuminaRL100 data. Both genes were 0% camouflaged CDS based on ONT and PacBio data and were 45.8% and 51.8% camouflaged CDS based on 10x Genomics data. In contrast to the results for SMN1 and SMN2, 10x Genomics was unable to resolve the HSPA1A and HSPA1B camouflaged regions. c CR1 was 26.0% camouflaged CDS based on illuminaRL100. 10x Genomics did not improve coverage for CR1; the region remained 26.4% camouflaged CDS. Both ONT and PacBio were 0% camouflaged CDS. While both PacBio and ONT were able to fill the camouflaged region, coverage dropped throughout the region, particularly for PacBio. The duplicated region is indicated by blue bars, where white lines indicate regions that have diverged sufficiently for short-reads to align uniquely. Regions were visualized with IGV. Reads with a MAPQ < 10 were filtered, and insertions, deletions, and mismatches are not shown
Fig. 8
Fig. 8
Many camouflaged regions can be rescued, including CR1, even in standard short-read sequencing data. Many large-scale whole-genome or whole-exome sequencing projects exist, covering tens of thousands of individuals. All of these datasets are affected by dark and camouflaged regions that may harbor mutations that either drive or modify disease in patients. Ideally, all samples would be re-sequenced using the latest technologies over time, but financial resources and biological samples are limited, making it essential to maximize the utility of existing data. We developed a method to rescue mutations in most camouflaged regions, including for standard short-read sequencing data. When confronted with a sequencing read that aligns to two or more regions equally well (with high confidence), most aligners (e.g., BWA [–13]) will randomly assign the read to one of the regions with a low mapping quality (e.g., MAPQ = 0 for BWA). a Because the reads are already aligned to one of the regions, we can use the following steps to rescue mutations in most camouflaged regions: (1) extract reads from camouflaged regions, (2) mask all highly similar regions in the reference genome, except one, and re-align the extracted reads, (3) call mutations using standard methods (adjusting for ploidy), and (4) determine precise location using targeted sequencing (e.g., long-range PCR combined with Sanger, or targeted long-read sequencing [1]). Without competing camouflaged regions to confuse the aligner, the aligner will assign a high mapping quality, allowing variant callers to behave normally. b Exons 10, 18, and 26 in CR1 are identical, according to the reference genome. Standard aligners will randomly scatter reads matching that sequence across these exons and assign a low mapping quality (e.g., MAPQ = 0 for BWA; indicated as hollow reads). Red lines indicate an individual’s mutation that exists in one of these exons, but reads containing this mutation also get scattered and assigned a low mapping quality. c By masking exons 18 and 26, we can align all of these reads to exon 10 with high mapping qualities to determine whether a mutation exists. We cannot determine at this stage which of the three exons the mutation is actually located in, but researchers can test association with a given disease to determine whether the mutation is worth further investigation. d As a proof of principle, we rescued approximately 4214 exonic variants in the ADSP (TiTv = 2.26) using our method, including a frameshift mutation in CR1 (MAF = 0.00019) that is found in five cases and zero controls (three representative samples shown). The frameshift results in a stop codon shortly downstream. The ADSP is not large enough to formally assess association between the CR1 frameshift and Alzheimer’s disease, but we believe the mutation merits follow-up studies given its location (CR1 binding domain) and CR1’s strong association with disease

References

    1. Ebbert MTW, Farrugia SL, Sens JP, Jansen-West K, Gendron TF, Prudencio M, et al. Long-read sequencing across the C9orf72 “GGGGCC” repeat expansion: implications for clinical use and genetic discovery efforts in human disease. Mol Neurodegener. 2018;13:46. doi: 10.1186/s13024-018-0274-4. - DOI - PMC - PubMed
    1. Zheng-Bradley X, Streeter I, Fairley S, Richardson D, Clarke L, Flicek P, et al. Alignment of 1000 Genomes Project reads to reference assembly GRCh38. Gigascience. 2017;6:1–8. doi: 10.1093/gigascience/gix038. - DOI - PMC - PubMed
    1. Callaway E. Human brain shaped by duplicate genes. Nature. 2012. 10.1038/nature.2012.10584.
    1. Charrier C, Joshi K, Coutinho-Budd J, Kim J-E, Lambert N, de Marchena J, et al. Inhibition of SRGAP2 function by its human-specific paralogs induces neoteny during spine maturation. Cell. 2012;149:923–935. doi: 10.1016/j.cell.2012.03.034. - DOI - PMC - PubMed
    1. Dennis MY, Nuttle X, Sudmant PH, Antonacci F, Graves TA, Nefedov M, et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell. 2012;149:912–922. doi: 10.1016/j.cell.2012.03.033. - DOI - PMC - PubMed

Publication types

Grants and funding

LinkOut - more resources