Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Aug 30;15(8):e1007332.
doi: 10.1371/journal.pcbi.1007332. eCollection 2019 Aug.

Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA

Affiliations

Dark-matter matters: Discriminating subtle blood cancers using the darkest DNA

Laxmi Parida et al. PLoS Comput Biol. .

Abstract

The confluence of deep sequencing and powerful machine learning is providing an unprecedented peek at the darkest of the dark genomic matter, the non-coding genomic regions lacking any functional annotation. While deep sequencing uncovers rare tumor variants, the heterogeneity of the disease confounds the best of machine learning (ML) algorithms. Here we set out to answer if the dark-matter of the genome encompass signals that can distinguish the fine subtypes of disease that are otherwise genomically indistinguishable. We introduce a novel stochastic regularization, ReVeaL, that empowers ML to discriminate subtle cancer subtypes even from the same 'cell of origin'. Analogous to heritability, implicitly defined on whole genome, we use predictability (F1 score) definable on portions of the genome. In an effort to distinguish cancer subtypes using dark-matter DNA, we applied ReVeaL to a new WGS dataset from 727 patient samples with seven forms of hematological cancers and assessed the predictivity over several genomic regions including genic, non-dark, non-coding, non-genic, and dark. ReVeaL enabled improved discrimination of cancer subtypes for all segments of the genome. The non-genic, non-coding and dark-matter had the highest F1 scores, with dark-matter having the highest level of predictability. Based on ReVeaL's predictability of different genomic regions, dark-matter contains enough signal to significantly discriminate fine subtypes of disease. Hence, the agglomeration of rare variants, even in the hitherto unannotated and ill-understood regions of the genome, may play a substantial role in the disease etiology and deserve much more attention.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Partitions of genomic regions based on Ensembl annotation and their predictability of blood cancer WGS data.
(A) The partition of the genomic region. (B-G) Median F1 values for the respective regions and their permuted controls when using off-the-shelf ML (B and E), ReVeaL on the original genomic areas (C and F) and ReVeaL on genomic areas normalized by length (D and G). See S2 Table for the F1 values.
Fig 2
Fig 2. Disease-by-disease ReVeaL Analysis.
(A) F1 scores for genomic sectors for each disease are averaged over all 10 replicate analyses per chromosome and the maximum F1 score is reported for that disease. ReVeaL scores on disease-label permutations are shown in overlaid hatched bars. The gray bar represents the mean over all diseases. (B-C) Boxplot of fg, shingle values representing the four moments of the distribution, of samples per disease and diseases ordered by decreasing median fg for the top 2 ReVeaL features. The line above each boxplot represents the shingle, the yellow interval representing the portion of the segment that is masked. (D-G) t-SNE visualization (perplexity = 40, iterations = 300) using the top 50 shingle fg values (B and C) and mutational load lg, number of mutations for a given window in the genomic region for a given patient, (D and E), respectively, in exonic and dark sectors.

References

    1. Pon JR, Marra MA. Driver and passenger mutations in cancer. Annu Rev Pathol. 2015;10:25–50. 10.1146/annurev-pathol-012414-040312 . - DOI - PubMed
    1. Khurana E, Fu Y, Chakravarty D, Demichelis F, Rubin MA, Gerstein M. Role of non-coding sequence variants in cancer. Nat Rev Genet. 2016;17(2):93–108. 10.1038/nrg.2015.17 . - DOI - PubMed
    1. Hrdlickova B, de Almeida RC, Borek Z, Withoff S. Genetic variation in the non-coding genome: Involvement of micro-RNAs and long non-coding RNAs in disease. Biochim Biophys Acta. 2014;1842(10):1910–22. Epub 2014/03/29. 10.1016/j.bbadis.2014.03.011 . - DOI - PubMed
    1. Dimitrieva S, Bucher P. Genomic context analysis reveals dense interaction network between vertebrate ultraconserved non-coding elements. Bioinformatics. 2012;28(18):i395–i401. Epub 2012/09/11. 10.1093/bioinformatics/bts400 . - DOI - PMC - PubMed
    1. Kellis M, Wold B, Snyder MP, Bernstein BE, Kundaje A, Marinov GK, et al. Defining functional DNA elements in the human genome. Proc Natl Acad Sci U S A. 2014;111(17):6131–8. Epub 2014/04/23. 10.1073/pnas.1318948111 . - DOI - PMC - PubMed