. 2016 Aug 5;11(8):e0157077.

doi: 10.1371/journal.pone.0157077. eCollection 2016.

Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Ivo D Dinov^{1

2

3}, Ben Heavner⁴, Ming Tang¹, Gustavo Glusman⁴, Kyle Chard⁵, Mike Darcy⁶, Ravi Madduri⁵, Judy Pa², Cathie Spino³, Carl Kesselman⁶, Ian Foster⁵, Eric W Deutsch⁴, Nathan D Price⁴, John D Van Horn², Joseph Ames², Kristi Clark², Leroy Hood⁴, Benjamin M Hampstead^{7

8}, William Dauer³, Arthur W Toga²

Affiliations

¹ Statistics Online Computational Resource, School of Nursing, Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, United States of America.
² Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, California, United States of America.
³ Udall Center of Excellence for Parkinson's Disease Research, University of Michigan, Ann Arbor, Michigan, United States of America.
⁴ Institute for Systems Biology, Seattle, Washington, United States of America.
⁵ Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, Illinois, United States of America.
⁶ Information Sciences Institute, University of Southern California, Los Angeles, California, United States of America.
⁷ Department of Psychiatry and Michigan Alzheimer's Disease Center, University of Michigan, Ann Arbor, Michigan, United States of America.
⁸ Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, Michigan, United States of America.

PMID: 27494614
PMCID: PMC4975403
DOI: 10.1371/journal.pone.0157077

Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Ivo D Dinov et al. PLoS One. 2016.

. 2016 Aug 5;11(8):e0157077.

doi: 10.1371/journal.pone.0157077. eCollection 2016.

Authors

Affiliations

¹ Statistics Online Computational Resource, School of Nursing, Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, United States of America.
² Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, California, United States of America.
³ Udall Center of Excellence for Parkinson's Disease Research, University of Michigan, Ann Arbor, Michigan, United States of America.
⁴ Institute for Systems Biology, Seattle, Washington, United States of America.
⁵ Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, Illinois, United States of America.
⁶ Information Sciences Institute, University of Southern California, Los Angeles, California, United States of America.
⁷ Department of Psychiatry and Michigan Alzheimer's Disease Center, University of Michigan, Ann Arbor, Michigan, United States of America.
⁸ Veterans Affairs Ann Arbor Healthcare System, Ann Arbor, Michigan, United States of America.

PMID: 27494614
PMCID: PMC4975403
DOI: 10.1371/journal.pone.0157077

Abstract

Background: A unique archive of Big Data on Parkinson's Disease is collected, managed and disseminated by the Parkinson's Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson's disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data-large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources-all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data.

Methods and findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson's disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting.

Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson's disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer's, Huntington's, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Fig 1. Exponential growth model of PD prevalence with age.**

**Fig 2. Overview of the complete analytics protocol (from data handling, to pre-processing, modeling, classification and forecasting).**

Fig 3. Overview of the global shape analysis (GSA) pipeline workflow for automated extraction of 280 neuroimaging biomarkers for 28 regions within each brain hemisphere and five complementary shape morphometry metrics.
Insert images illustrate examples of the nested processing steps and the textual, visual and statistical output generated by the pipeline protocol.

**Fig 4. Histogram of missing rates of the 64 top-level UPDRS variables (median missing rate ∼ 0.5).**

**Fig 5. Schematic of iterative data splitting statistical n-fold cross-validation protocol.**

Fig 6. Fragments of the analytical representations and class distributions of the AdaBoost classifier (complete details are in Data B in S1 File, *AdaBoost Classifier model (based on RWeka)*) generated by the first step (predictive *analysis*).

**Fig 7. (Partial) Varplots illustrating some of the critical predictive data elements for AdaBoost classifier predicting HC vs. (PD+SWEDD).**

Fig 8. Frequency plot of data elements that appear as more reliable predictors of subject diagnosis, ranked by counts using rebalanced URPRS data (see Table B in S1 File for the complete numerical results).

See this image and copyright information in PMC

References

1. Amiri S, Clarke B, Clarke J. Clustering categorical data via ensembling dissimilarity matrices. arXiv preprint arXiv:150607930. 2015.
1. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews Cancer. 2008;8(1):37–49. - PMC - PubMed
1. Dinov ID, Petrosyan Petros, Liu Zhizhong, Eggert Paul, Zamanyan Alen, Torri Federica, Macciardi Fabio, Hobel Sam, Moon Seok Woo, Sung Young Hee, Toga AW. The perfect neuroimaging-genetics-computation storm: collision of petabytes of data, millions of hardware devices and thousands of software tools. Brain Imaging and Behavior. 2014;8(2):311–22. 10.1007/s11682-013-9248-x - DOI - PMC - PubMed
1. Walter C. Kryder's law. Scientific American. 2005;293(2):32–3. - PubMed
1. Mollick E. Establishing Moore's law. Annals of the History of Computing, IEEE. 2006;28(3):62–75.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Affiliations

Predictive Big Data Analytics: A Study of Parkinson's Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical