Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May:210:78-122.
doi: 10.1016/j.artint.2014.02.004.

The Dropout Learning Algorithm

Affiliations

The Dropout Learning Algorithm

Pierre Baldi et al. Artif Intell. 2014 May.

Abstract

Dropout is a recently introduced algorithm for training neural network by randomly dropping units during training to prevent their co-adaptation. A mathematical analysis of some of the static and dynamic properties of dropout is provided using Bernoulli gating variables, general enough to accommodate dropout on units or connections, and with variable rates. The framework allows a complete analysis of the ensemble averaging properties of dropout in linear networks, which is useful to understand the non-linear case. The ensemble averaging properties of dropout in non-linear logistic networks result from three fundamental equations: (1) the approximation of the expectations of logistic functions by normalized geometric means, for which bounds and estimates are derived; (2) the algebraic equality between normalized geometric means of logistic functions with the logistic of the means, which mathematically characterizes logistic functions; and (3) the linearity of the means with respect to sums, as well as products of independent variables. The results are also extended to other classes of transfer functions, including rectified linear functions. Approximation errors tend to cancel each other and do not accumulate. Dropout can also be connected to stochastic neurons and used to predict firing rates, and to backpropagation by viewing the backward propagation as ensemble averaging in a dropout linear network. Moreover, the convergence properties of dropout can be understood in terms of stochastic gradient descent. Finally, for the regularization properties of dropout, the expectation of the dropout gradient is the gradient of the corresponding approximation ensemble, regularized by an adaptive weight decay term with a propensity for self-consistent variance minimization and sparse representations.

Keywords: backpropagation; ensemble; geometric mean; machine learning; neural networks; regularization; sparse representations; stochastic gradient descent; stochastic neurons; variance minimization.

PubMed Disclaimer

Figures

Figure 1.1
Figure 1.1
Dropout training in a simple network. For each training example, feature detector units are dropped with probability 0.5. The weights are trained by backpropagation (BP) and shared with all the other examples.
Figure 1.2
Figure 1.2
Dropout prediction in a simple network. At prediction time, all the weights from the feature detectors to the output units are halved.
Figure 8.1
Figure 8.1
The curve associated with the approximate bound |ENWGM| ≲ E(1 – E)|1 – 2E|/[1 – 2E(1 – E)] (Equation 87).
Figure 8.2
Figure 8.2
The curve associated with the approximate bound |ENWGM| ≲ 2E(1 – E)|1 – 2E| (Equation 87).
Figure 9.1
Figure 9.1
Histogram of NWGM values for a random sample of 100 values O taken from: (1) the uniform distribution over [0,1] (upper left); (2) the uniform distribution over [0,0.5] (lower left); (3) the normal distribution with mean 0.5 and standard deviation 0.1 (upper right); and (4) the normal distribution with mean 0.25 and standard deviation 0.05 (lower right). All probability weights are equal to 1/100. Each sampling experiment is repeated 5,000 times to build the histogram.
Figure 9.2
Figure 9.2
Behavior of the Pearson correlation coefficient (left) and the covariance (right) between the empirical expectation E and the empirical NWGM as a function of the number of samples and sample distribution. For each number of samples, the sampling procedure is repeated 10,000 times to estimate the Pearson correlation and covariance. The distributions are the uniform distribution over [0,1], the uniform distribution over [0,0.5], the normal distribution with mean 0.5 and standard deviation 0.1, and the normal distribution with mean 0.25 and standard deviation 0.05.
Figure 9.3
Figure 9.3
Each row corresponds to a scatter plot for all the neurons in each one of the four hidden layers of a deep classifier trained on the MNIST dataset (see text) after learning. Scatter plots are derived by cumulating the results for 10 random chosen inputs. Dropout expectations are estimated using 10,000 dropout samples. The second order approximation in the left column (blue dots) correspond to |ENWGM| ≈ V|1 – 2E|/(1 – 2V) (Equation 87). Bound 1 is the variance-dependent bound given by E(1 – E)|1 – 2E|/(1 – 2V) (Equation 87). Bound 2 is the variance-independent bound given by E(1–E)|1–2E|/(1–2E(1–E)) (Equation 87). In the right column, W represent the neuron activations in the deterministic ensemble network with the weights scaled appropriately and corresponding to the “propagated” NWGMs.
Figure 9.4
Figure 9.4
Similar to Figure 9.3, using the sharper but potentially more restricted second order approximation to the NWGM obtained by using a Taylor expansion around the mean (see Appendix B, Equation 202).
Figure 9.5
Figure 9.5
Similar to Figures 9.3 and 9.4. Approximation 1 corresponds to the second order Taylor approximation around 0.5: ∥ENWGM| ≈ V|1 – 2E|/(1 – 2V) (Equation 87). Approximation 2 is the sharper but more restrictive second order Taylor approximation around E:E(V2E)1([0.5V][E(1E)]) (see Appendix B, Equation 202). Histograms for the two approximations are interleaved in each figure of the right column.
Figure 9.6
Figure 9.6
Empirical distribution of NWGME is approximately Gaussian at each layer, both before and after training. This was performed with Monte Carlo simulations over dropout subnetworks with 10,000 samples for each of 10 fixed inputs. After training, the distribution is slightly asymmetric because the activation of the neurons is asymmetric. The distribution in layer one before training is particularly tight simply because the input to the network (MNIST data) is relatively sparse.
Figure 9.7
Figure 9.7
Empirical distribution of WE is approximately Gaussian at each layer, both before and after training. This was performed with Monte Carlo simulations over dropout subnetworks with 10,000 samples for each of 10 fixed inputs. After training, the distribution is slightly asymmetric because the activation of the neurons is asymmetric. The distribution in layer one before training is particularly tight simply because the input to the network (MNIST data) is relatively sparse.
Figure 9.8
Figure 9.8
Approximation of E(OilOil) by Wil and by WilWil corresponding respectively to the estimates Wil(1Wil) and for the variance for neurons in a MNIST classifier network before and after training. Histograms are obtained by taking all non-input neurons and aggregating the results over 10 random input vectors.
Figure 9.9
Figure 9.9
Histogram of the difference between the dropout variance of Oil and its approximate upperbound Wil(Wil) in a MNIST classifier network before and after training. Histograms are obtained by taking all non-input neurons and aggregating the results over 10 random input vectors. Note that at the beginnning of learning, with random small weights, E(Oil)Wil0.5, and thus Var(Oil)0 whereas Wil(1Wil)0.25.
Figure 9.10
Figure 9.10
Temporal evolution of the dropout variance V(O) during training averaged over all hidden units.
Figure 9.11
Figure 9.11
Temporal evolution of the difference W(1 – W ) – V during training averaged over all hidden units.
Figure 9.12
Figure 9.12
Approximation of E(OilOjh) by WilWjh for pairs of non-input neurons that are not directly connected to each other in a MNIST classifier network, before and after training. Histograms are obtained by taking 100,000 pairs of unconnected neurons, uniformly at random, and aggregating the results over 10 random input vectors.
Figure 9.13
Figure 9.13
Comparison of E(OilOjh) to 0 for pairs of non-input neurons that are not directly connected to each other in a MNIST classifier network, before and after training. As shown in the previous figure, WilWjh provides a better approximation. Histograms are obtained by taking 100,000 pairs of unconnected neurons, uniformly at random, and aggregating the results over 10 random input vectors.
Figure 9.14
Figure 9.14
Approximation of E(OilOjh) by WilWijlh and WilWjh for pairs of connected non-input neurons, with a directed connection from j to i in a MNIST classifier network, before and after training. Histograms are obtained by taking 100,000 pairs of connected neurons, uniformly at random, and aggregating the results over 10 random input vectors.
Figure 9.15
Figure 9.15
Histogram of the difference between E(σ′(S)) and σ′(E(S)) all non-input neurons, in a MNIST classifier network, before and after training. Histograms are obtained by taking all non-input neurons and aggregating the results over 10 random input vectors. The nodes in the first hidden layer have 784 sparse inputs, while the nodes in the upper three hidden layers have 1200 non-sparse inputs. The distribution of the initial weights are also slightly different for the first hidden layer. The differences between the first hidden layer and all the other hidden layers are responsible for the initial bimodal distribution.
Figure 10.1
Figure 10.1
A spiking neuron formally operates in 3 steps by computing first a linear sum S, then a probability O = σ(S), then a stochastic output Δ of size r with probability O(and 0 otherwise).
Figure 10.2
Figure 10.2
Three closely related networks. The first network operates stochastically and consists of spiking neurons: a neuron sends a spike of size r with probability O. The second network operates stochastically and consists of logistic dropout neurons: a neurons sends an activation O with a dropout probability r. The connection weights in the first and second networks are identical. The third network operates in a deterministic way and consists of logistic neurons. Its weights are equal to the weights of the second network multiplied by the corresponding probability r.
Figure 11.1
Figure 11.1
Empirical distribution of final neuron activations in each layer of the trained MNIST classifer demonstrating the sparsity. The empirical distributions are combined over 1000 different input examples.
Figure 11.2
Figure 11.2
The three phases of learning. For a particular input, a typical active neuron (red) starts out with low dropout variance, experiences an increase in variance during learning, and eventually settles to some steady constant consitency value. A typical inactive neuron (blue) quickly learns to stay silent. Its dropout variance grows only minimally from the low initial value. Curves correspond to mean activation with 5% and 95% percentiles. This is for a single fixed input, and 1000 dropout Monte Carlo simulations.
Figure 11.3
Figure 11.3
Consistency of active neurons does not noticeably decline in the upper layers. ’Active’ neurons are defined as those with activation greater than 0.1 at the end of training. There were at least 100 active neurons in each layer. For these neurons, 1000 dropout simulations were performed at each time step of 100 training epochs. The plot represents the dropout mean standard deviation and 5%, 95% percentiles computed over all the active neurons in each layer. Note that the standard deviation does not increase for the higher layers.

References

    1. Aldaz J. Self improvement of the inequality between arithmetic and geometric means. J. Math. Inequal. 2009;3(2):213–216.
    1. Aldaz J. Sharp bounds for the difference between the arithmetic and geometric means. 2012 arXiv preprint arXiv:1203.4454.
    1. Alon N, Spencer JH. The probabilistic method. John Wiley & Sons; 2004.
    1. Alzer H. A new refinement of the arithmetic mean geometric mean inequality. Journal of Mathematics. 1997;27(3)
    1. Alzer H. Some inequalities for arithmetic and geometric means. Proceedings of the Royal Society of Edinburgh: Section A Mathematics. 1999;129(02):221–228.