Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Apr 8:8:31.
doi: 10.3389/fninf.2014.00031. eCollection 2014.

Machine learning patterns for neuroimaging-genetic studies in the cloud

Affiliations

Machine learning patterns for neuroimaging-genetic studies in the cloud

Benoit Da Mota et al. Front Neuroinform. .

Abstract

Brain imaging is a natural intermediate phenotype to understand the link between genetic information and behavior or brain pathologies risk factors. Massive efforts have been made in the last few years to acquire high-dimensional neuroimaging and genetic data on large cohorts of subjects. The statistical analysis of such data is carried out with increasingly sophisticated techniques and represents a great computational challenge. Fortunately, increasing computational power in distributed architectures can be harnessed, if new neuroinformatics infrastructures are designed and training to use these new tools is provided. Combining a MapReduce framework (TomusBLOB) with machine learning algorithms (Scikit-learn library), we design a scalable analysis tool that can deal with non-parametric statistics on high-dimensional data. End-users describe the statistical procedure to perform and can then test the model on their own computers before running the very same code in the cloud at a larger scale. We illustrate the potential of our approach on real data with an experiment showing how the functional signal in subcortical brain regions can be significantly fit with genome-wide genotypes. This experiment demonstrates the scalability and the reliability of our framework in the cloud with a 2 weeks deployment on hundreds of virtual machines.

Keywords: cloud computing; fMRI; heritability; machine learning; neuroimaging-genetic.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Top: Representation of the computational framework: given the data, a permutation and a phenotype index together with a configuration file, a set of computations are performed, that involve two layers of cross-validation for setting the hyper-parameters and evaluate the accuracy of the model. This yields a statistical score associated with the given phenotype and permutation. Bottom: Example of complex configuration file that describes this set of operations. General parameters (Lines 1–3): The model contains covariates, the permutation test makes 10,000 iterations and only one permutation is performed in a task. Prediction score (Lines 4–7): The metrics for the cross-validated prediction score is R2, the cross-validation loop makes 10 iterations, 20% of the data are left out for the test set and the seed of the random generator was set to 0. Estimator pipeline (Lines 8–13): The first step consists in filtering collinear vectors, the second step selects the K best features and the final step is a ridge estimator. Parameters selection (Lines 14–16): Two parameters of the estimator have to be set: the K for the SelectKBest and the alpha of the Ridge regression. A set of 3 × 5 parameters are evaluated.
Figure 2
Figure 2
Overview of the multi site deployment of a hierarchical Tomus-MapReduce compute engine. (1) The end-user uploads the data and configures the statistical inference procedure on a webpage. (2) The Splitter partitions the data and manages the workload. The compute engines retrieves job information trough the Windows Azure Queues. (3) Compute engines perform the map and reduce jobs. The management deployment is informed of the progression via the Windows Azure Queues system and thus can manage the execution of the global reducer. (4) The user downloads the results of the computation on the webpage of the experiment.
Figure 3
Figure 3
Configuration used for the experiment. (Lines 1–3): Covariates, 10,000 permutations and five permutations per computation unit (mapper). (Lines 4–7): 10-folds cross-validated R2. (Lines 9–11): The first step of the pipeline is an univariate features selection (K = 50,000). This step is used as a dimension reduction so that the next step fits in memory. (Lines 12–13): The second and last step is the ridge estimator with a low penalty (alpha = 0.0001).
Figure 4
Figure 4
Results of the real data analysis procedure. (Left) predictive accuracy of the model measured by cross-validation, in the 14 regions of interest, and associated statistical significance obtained in the permutation test. (Up right) distribution of the CVR2 at chance level, obtained through a permutation procedure. The distribution of the max over all ROIs is used to obtain the family-wise error corrected significance of the test. (Bottom right) outline of the chosen ROIs.

References

    1. Anderson M. J., Robinson J. (2001). Permutation tests for linear models. Aust. N. Z. J. Stat. 43, 75–88 10.1111/1467-842X.00156 - DOI
    1. Bunea F., She Y., Ombao H., Gongvatana A., Devlin K., Cohen R. (2011). Penalized least squares regression methods and applications to neuroimaging. Neuroimage 55, 1519–1527 10.1016/j.neuroimage.2010.12.028 - DOI - PMC - PubMed
    1. Chu C.-T., Kim S. K., Lin Y.-A., Yu Y., Bradski G. R., Ng A. Y., et al. (2006). Map-reduce for machine learning on multicore, in NIPS (Vancouver, BC: ), 281–288
    1. Costan A., Tudoran R., Antoniu G., Brasche G. (2013). TomusBlobs: scalable data-intensive processing on Azure clouds. J. Concurr. Comput. Pract. Exp. 10.1002/cpe.3034 - DOI - PubMed
    1. Da Mota B., Frouin V., Duchesnay E., Laguitton S., Varoquaux G., Poline J.-B., et al. (2012). “A fast computational framework for genome-wide association studies with neuroimaging data,” in 20th International Conference on Computational Statistics (Limassol: ).

LinkOut - more resources