Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb 10;15(1):26.
doi: 10.1186/s12920-022-01173-4.

Identification of recurrent genetic patterns from targeted sequencing panels with advanced data science: a case-study on sporadic and genetic neurodegenerative diseases

Affiliations

Identification of recurrent genetic patterns from targeted sequencing panels with advanced data science: a case-study on sporadic and genetic neurodegenerative diseases

M Tarozzi et al. BMC Med Genomics. .

Abstract

Background: Targeted Next Generation Sequencing is a common and powerful approach used in both clinical and research settings. However, at present, a large fraction of the acquired genetic information is not used since pathogenicity cannot be assessed for most variants. Further complicating this scenario is the increasingly frequent description of a poli/oligogenic pattern of inheritance showing the contribution of multiple variants in increasing disease risk. We present an approach in which the entire genetic information provided by target sequencing is transformed into binary data on which we performed statistical, machine learning, and network analyses to extract all valuable information from the entire genetic profile. To test this approach and unbiasedly explore the presence of recurrent genetic patterns, we studied a cohort of 112 patients affected either by genetic Creutzfeldt-Jakob (CJD) disease caused by two mutations in the PRNP gene (p.E200K and p.V210I) with different penetrance or by sporadic Alzheimer disease (sAD).

Results: Unsupervised methods can identify functionally relevant sources of variation in the data, like haplogroups and polymorphisms that do not follow Hardy-Weinberg equilibrium, such as the NOTCH3 rs11670823 (c.3837 + 21 T > A). Supervised classifiers can recognize clinical phenotypes with high accuracy based on the mutational profile of patients. In addition, we found a similar alteration of allele frequencies compared the European population in sporadic patients and in V210I-CJD, a poorly penetrant PRNP mutation, and sAD, suggesting shared oligogenic patterns in different types of dementia. Pathway enrichment and protein-protein interaction network revealed different altered pathways between the two PRNP mutations.

Conclusions: We propose this workflow as a possible approach to gain deeper insights into the genetic information derived from target sequencing, to identify recurrent genetic patterns and improve the understanding of complex diseases. This work could also represent a possible starting point of a predictive tool for personalized medicine and advanced diagnostic applications.

Keywords: Alzheimer’s Disease; CJD; Complex diseases; Gene panels; Genetic modifiers; Machine learning; NGS; Neurodegeneration; Polygenic score.

PubMed Disclaimer

Conflict of interest statement

The authors have no competing interests to declare.

Figures

Fig. 1
Fig. 1
2D plot of the Principal Component Analysis (PCA) computed on the 1046 × 112 ternary matrix. PCA is a dimensionality reduction technique that computes an orthogonal linear transformation of the data to a new 2D coordinate system so that the greatest variance is on the x-axis (PC1) and the second greatest variance on y-axis. Each dot represents a patient, that is plotted in the 2D space accordingly to its genetic profile expressed in the ternary matrix. PC1 and PC2 show the main sources of variance in our data, accounting for 22% of overall variance, that are represented by variants on MAPT and NOTCH3 genes, respectively. PCA plot and hierarchical clustering recognize clusters that correspond to the MAPT haplotypes on the x-axis, as shown by coloured labels in the picture legend. Similarly, the distribution along the y-axis matches haplotypes in the notch3 gene (not shown)
Fig. 2
Fig. 2
Dataset classification according to decision trees analysis: this supervised method computes on the 1046 × 112 matrix a classification based on the labels provided. The classifier correctly identifies the two disease groups on the two disease-causing mutations
Fig. 3
Fig. 3
Result of Decision Trees analysis on the dataset deprived of the information about gCJD-causing mutations. Classification is accomplished with 0.71 accuracy for sAD and 0.85 for gCJD. Classification is based on the reported eight variants harboured in six genes. Four of these are variants of uncertain significance not reported in the GnomAD database harbored in the genes APP c.*1A > C (rs748508166), GRN c.1179 + 100A > T, DCTN1 p.Lys519Glu, PRKAR1B c.595 + 369 T > C (rs1342588350), two of them are rare (Minor Allele Frequency < 0.05) variants in the European population, APP p.Phe435 = (rs148180403, MAF = 0.001), DCTN1 p.Ala816 = (rs1130484, MAF = 0.007) and two are common benign variants in CHCHD10 (c.261 + 99A > G) and GSN (c.666 + 53 T > C). “Value” indicates the number of samples at the given node that fall into each category. The “Gini” score quantifies the purity of the node/leaf, when greater than zero implies that samples contained within that node belong to different classes while a gini score of zero means that within that node only a single class of samples exist
Fig. 4
Fig. 4
Result of functional enrichment analysis performed on genes harbouring variants with significantly altered allele frequency compared to European population reported in the GnomAd database. Results of pathway analysis are reported as significantly (p < 0.05) enriched pathways in the first group but not in the second of each coupled comparison. Since part of the affected pathways are shared among the considered conditions, results are reported as differences between comparisons of two groups. Complete results of the functional analysis with Gene Ontology and of the Protein–Protein Interaction networks are reported in Supplementary materials

Similar articles

Cited by

References

    1. Kousi M, Katsanis N. Genetic modifiers and oligogenic inheritance. Cold Spring Harb Perspect Med. 2015;5:1–22. - PMC - PubMed
    1. Rahit KMTH, Tarailo-Graovac M. Genetic modifiers and rare mendelian disease. Genes (Basel). 2020;11 - PMC - PubMed
    1. Paré G, Mao S, Deng WQ. A machine-learning heuristic to improve gene score prediction of polygenic traits. Sci Rep. 2017;7:1–11. - PMC - PubMed
    1. Xu C, Jackson SA. Machine learning and complex biological data. Genome Biol. 2019;20:76. - PMC - PubMed
    1. Laing C, et al. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;1:31. - PMC - PubMed

Publication types