Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Oct 1;42(4):1414-29.
doi: 10.1016/j.neuroimage.2008.05.050. Epub 2008 Jun 6.

Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns

Affiliations

Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns

Okito Yamashita et al. Neuroimage. .

Abstract

Recent studies have used pattern classification algorithms to predict or decode task parameters from individual fMRI activity patterns. For fMRI decoding, it is important to choose an appropriate set of voxels (or features) as inputs to the decoder, since the presence of many irrelevant voxels could lead to poor generalization performance, a problem known as overfitting. Although individual voxels could be chosen based on univariate statistics, the resulting set of voxels could be suboptimal if correlations among voxels carry important information. Here, we propose a novel linear classification algorithm, called sparse logistic regression (SLR), that automatically selects relevant voxels while estimating their weight parameters for classification. Using simulation data, we confirmed that SLR can automatically remove irrelevant voxels and thereby attain higher classification performance than other methods in the presence of many irrelevant voxels. SLR also proved effective with real fMRI data obtained from two visual experiments, successfully identifying voxels in corresponding locations of visual cortex. SLR-selected voxels often led to better performance than those selected based on univariate statistics, by exploiting correlated noise among voxels to allow for better pattern separation. We conclude that SLR provides a robust method for fMRI decoding and can also serve as a stand-alone tool for voxel selection.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Two elements of sparse logistic regression (SLR): the (multinomial) logistic regression model (a) and the automatic relevance determination (ARD) (b). (a) Each class or label has its own discriminant function, which calculates the inner product of the weight parameter vector of the label (θ) and an input feature vector (x). The softmax function transforms the outputs of the discriminant functions to the probability of observing each label. The label with the maximum probability is chosen as the output label. Binary logistic regression is slightly different from multinomial logistic regression. The probability can be calculated by the logistic transformation of a single discriminant function that separates two classes (corresponding to (θ1−θ2)tx). SLR uses this conventional model for (multinomial) logistic regression, but the estimation of weight parameters involves a novel algorithm based on the automatic relevance determination. (b) SLR treats the weight parameters as random variables with prior distributions. The prior of each parameter θi is assumed to have a Gaussian distribution with mean 0. The precision (inverse variance) of the normal distribution is regarded as a hyper-parameter αi, called a relevance parameter, with a hyper-prior distribution defined by a gamma distribution. The relevance parameter controls the range of the corresponding weight parameter. If the relevance parameter is large, the probability sharply peaks at zero as prior knowledge (left panel), and thus the estimated weight parameter tends to be biased toward zero even after observation. On the other hand, if the relevance parameter is small, the probability is broadly distributed (right panel), and thus the estimated weight parameter can take a large value after observation. While our iterative algorithm computes the posterior distributions of the model, most relevance parameters diverge to infinity. Thus, the corresponding weight parameters become effectively zeros, and can be pruned from the model. This process of determining the relevance of parameters is called the ARD. For the details of the algorithm, see Appendix A.
Fig. 2
Fig. 2
SLR-based voxel ranking procedure. A whole data set is randomly separated into K pairs of training and test data sets. For each pair, SLR is applied to the training set to learn the weight parameters, which results in sparse parameter selection. Then, the classification performance is evaluated using the test set. The score of each parameter (SC-value) is defined by the count of selection in K-time SLR estimations, weighted by the corresponding test performance (percent correct).
Fig. 3
Fig. 3
Evaluation of SLR using simulation data. Data samples for binary classification were randomly generated from two Gaussian distributions. Only the first 10 dimensions were set to be informative with graded mean differences. Note that as the problem here is binary, the number of dimensions/features is identical to that of weight parameters. (a) Binary classification performance is plotted as a function of the number of initial input dimensions/features. Mean and standard errors computed from 200 Monte Carlo simulations are plotted. The solid, dotted and dashed lines indicate the results for SLR, regularized logistic regression (RLR) and support vector machine (SVM), respectively. RLR uses the same logistic regression model as SLR, but does not impose sparsity in estimating weight parameters. (b) The average number of selected dimensions by SLR is plotted against the number of initial dimensions. (c) The normalized frequency that each feature was selected by SLR in 200 repetitions of Monte Carlo simulation is plotted against the mean differences of the first 10 features. The lines indicate the results for different numbers of initial dimensions.
Fig. 4
Fig. 4
Decoding of four quadrant stimuli. The locations of voxels selected by SLR are shown on the anatomical image. Filled squares indicate selected voxels for each of the four quadrants as in the legend. Note that in the multinomial logistic regression model, each class (quadrant) has its own weight parameters (see Fig. 1). The color indicates the class to which the selected weight parameter belongs. The lighter region shows the occipital mask, from which an initial set of voxels was identified. Only a few voxels were selected for this task (six voxels in total for this subject), and the selected voxels for each quadrant were found in the vertically and horizontally flipped locations, consistent with the visual field mapping in the early visual cortex. Trial-averaged BOLD time courses (percent signal change relative to the rest) are plotted for each of the selected voxels. Time 0 corresponds to the stimulus onset. The color here indicates the stimulus condition (one of the four quadrants) as in the legend.
Fig. 5
Fig. 5
Difference between SC-values and T-values. SC-values (solid line) and T-values (bars) are plotted for voxels sorted by the SC-values. These values were obtained for the classification of 0 vs. 135 degrees of orientation.
Fig. 6
Fig. 6
Comparison of classification performance between the SC-value and the T-value rankings. The test performance for the classification of two orientations, chosen from eight orientations (0, 22.5, 45,… degrees), is plotted against the number of voxels. Voxels were sorted either by the SC-values or by the T-values, and those with highest ranks were used. The results of all orientation pairs were grouped by the orientation differences. Panels (a–d) summarize the results of 22.5 degree (8 pairs of orientations), 45 degree (8 pairs), 67.5 degree (8 pairs), and 90 degree (4 pairs) differences, respectively. Voxel ranking was computed for each pair of orientations. The blue and red lines indicate test performance for the SC-value ranking and the T-value ranking, respectively. The shaded areas represent the standard errors.
Fig. 7
Fig. 7
Contribution of voxel correlation to classification. (a) The values of the top two voxels in the SC-value ranking (Fig. 5) are shown in a scatter plot and histograms. The red diamonds and the blue crosses represent 0 degree and 135 degree samples in the training data set, respectively. The gray line is the discriminant boundary estimated by logistic regression. Histograms show the distributions of the samples along the axes of the first and the second voxels, and along the axis orthogonal to the discriminant boundary. The first voxel (x axis) is poorly discriminative (as indicated by the low T-value in Fig. 5), while the second voxel (y axis) is more discriminative. When these voxels are combined (the axis orthogonal to the discriminant boundary), the distributions of two classes become even more discriminative. Note that the discriminant boundary provides better discrimination than the second voxel alone. The first voxel could contribute to the discrimination via its correlation with the second voxel, even though it has a low T-value and is poorly discriminative itself. (b) The values in the original data (a) were shuffled within each voxel and class so that the correlation between voxels was removed from the distribution of each class. The histograms of two individual voxels are identical to those of the original data (a). But the discriminant boundary is different: the discrimination is almost solely dependent on the second voxel.
Fig. 8
Fig. 8
Effect of shuffling on the performance of SC-ranked voxels and T-ranked voxels. The same analysis as in Fig. 6 was performed with shuffled training data. The difference in test performance between the original and the shuffled training data was calculated (shuffle measure). The average shuffle measure (over 300 times shufflings) is plotted as a function of the number of voxels, for SC-ranked voxels (blue) and T-ranked voxels (red) and for four orientation differences.

Similar articles

Cited by

References

    1. Attias H. Inferring parameters and structure of latent variable models by variational Bayes. Proc. 15th Conference on Uncertainty in Artificial Intelligence; Morgan Kaufmann Pub; 1999. pp. 21–30.
    1. Averbeck BB, Latham PE, Pouget A. Neural correlations, population coding and computation. Nat Rev, Neurosci. 2006;7:358–366. - PubMed
    1. Baker CI, Hutchison TL, Kanwisher N. Does the fusiform face area contain subregions highly selective for nonfaces? Nat Neurosci. 2007;10:3–4. - PubMed
    1. Bishop C, Tipping ME. Variational relevance vector machines. Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence; 2000. pp. 46–53.
    1. Bishop C. Pattern Recognition and Machine Learning. Springer; New York: 2006.

Publication types

MeSH terms