Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul 2;121(27):e2311805121.
doi: 10.1073/pnas.2311805121. Epub 2024 Jun 24.

Representations and generalization in artificial and brain neural networks

Affiliations
Review

Representations and generalization in artificial and brain neural networks

Qianyi Li et al. Proc Natl Acad Sci U S A. .

Abstract

Humans and animals excel at generalizing from limited data, a capability yet to be fully replicated in artificial intelligence. This perspective investigates generalization in biological and artificial deep neural networks (DNNs), in both in-distribution and out-of-distribution contexts. We introduce two hypotheses: First, the geometric properties of the neural manifolds associated with discrete cognitive entities, such as objects, words, and concepts, are powerful order parameters. They link the neural substrate to the generalization capabilities and provide a unified methodology bridging gaps between neuroscience, machine learning, and cognitive science. We overview recent progress in studying the geometry of neural manifolds, particularly in visual object recognition, and discuss theories connecting manifold dimension and radius to generalization capacity. Second, we suggest that the theory of learning in wide DNNs, especially in the thermodynamic limit, provides mechanistic insights into the learning processes generating desired neural representational geometries and generalization. This includes the role of weight norm regularization, network architecture, and hyper-parameters. We will explore recent advances in this theory and ongoing challenges. We also discuss the dynamics of learning and its relevance to the issue of representational drift in the brain.

Keywords: deep neural networks; few-shot learning; neural manifolds; representational drift; visual cortex.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
(A) Illustration of three layers in a visual hierarchy where the population response of the first layer is mapped into intermediate layer by F1 and into the last layer by F2 (Top) (10). The transformation of per-stimuli responses is associated with changes in the geometry of the object manifold, the collection of responses to stimuli of the same object (colored blue for a “dog” manifold and pink for a “cat” manifold). Changes in geometry may result in transforming object manifolds that are not linearly separable (in the first and intermediate layers) into separable ones in the last layer (separating hyperplane, colored orange). (B) Changes in classification capacity αC, manifold radius RM, manifold dimension DM, and classification margin κ across the layers of pre-trained DNNs (ResNets).
Fig. 2.
Fig. 2.
(A and B) Examples of novel objects, here “coatis” (blue) and “numbats” (green), are presented to the ventral visual pathway (Top), modeled by a trained DNN (Bottom), eliciting a pattern of activity across IT-like neurons in the feature layer. We model concept learning as learning a linear readout w to classify these activity patterns. (C), Generalization accuracy is very high across pairs of novel objects from the ImageNet21k dataset when using a pre-trained DNN (orange), but poor when using a randomly initialized DNN (blue), or a linear classifier in the pixel space of input images (gray). (D) Few-shot learning improves along the ventral visual hierarchy from pixels to V1 to V4 to IT, due to orchestrated transformations of object manifold geometry. The layerwise behavior of a trained ResNet50 (blue), Alexnet (light blue), and an untrained ResNet50 (gray) is included for comparison. We align V1, V4, and IT to the most similar ResNet layer under the BrainScore metric (20) (see ref. for details).
Fig. 3.
Fig. 3.
(A) We compare the empirical generalization error in 1-, 2-, and 5-shot learning experiments to the prediction from our geometric theory (Eq. 3) on all pairs of objects from the ImageNet21k dataset, using object manifolds derived from a trained ResNet50. x-axis: SNR obtained by estimating neural manifold geometry. y-axis: Empirical generalization error measured in few-shot learning experiments. Theoretical prediction (dashed line) shows a good match with experiments. (B) We provide additional examples of 5-shot prototype learning experiments in a ResNet50 (colored points), along with the prediction from our geometric theory (dashed line), on four randomly selected novel visual objects from the ImageNet21k dataset. Each panel plots the generalization error of one novel visual object (e.g., “Virginia bluebell”) against all 999 other novel visual objects. Each point represents the average generalization error on one such pair of objects. x-axis: SNR (Eq. 3) obtained by estimating neural manifold geometry. y-axis: Empirical generalization error measured in few-shot learning experiments. Theoretical prediction (dashed line) shows a good match with experiments. (C) In a pre-trained ResNet50 (blue) dimensionality expands dramatically in the early layers and contracts in the later layers, while in the primate visual pathway (Black) dimensionality contracts from the V1-like layer to V4, then expands from V4 to IT. (D) Single-manifold eigenspectra in macaque V4 (black) and the corresponding layer of a pre-trained ResNet50 (blue).
Fig. 4.
Fig. 4.
(A) Two stages of learning of Langevin dynamics with small T, σ0 controls the width of weight distribution at initialization, σ controls the size of the solution space, and T relates to the sampling speed. (B) Example trajectories of the predictor from three different initializations, the dynamics is initially deterministic and starts to fluctuate as Θt drifts in the solution space after reaching zero training error.
Fig. 5.
Fig. 5.
(A) Schematics for the BPKR approach. A renormalization factor is introduced at each step during backward integration until all the network weights are averaged out. (B) Theory (black solid line) and simulation (blue points) of generalization error εg=ϵg(x,y(x)){x,y(x)} on binary MNIST classification in fully connected ReLU networks, for small (Top) and large (Bottom) σ. The approximate theory for ReLU networks agrees remarkably well with the numerics.
Fig. 6.
Fig. 6.
(A and B) NNGP and mean layer-wise kernels in classifying eight MNIST digits SI Appendix, 3, (57). (C) Generalization error averaged across test examples for finite width random feature model (blue), infinitely wide network following the NNGP theory (yellow), and the learned network following the BPKR theory (red, overlaying NNGP theory). (D) The NNGP kernel, and the mean layer-wise kernel of hidden layer l=2,4. For a 4-hidden-layer ReLU network trained on four MNIST digits grouped into two higher-order categories of even vs. odd. The values of the kernel are small since we take relatively small σ (SI Appendix, 3).
Fig. 7.
Fig. 7.
(A) Signal as a function of hidden layer depth l. For ReLU networks in the thermodynamic limit, signal increases with layer depth (blue). For linear networks (yellow, purple) and ReLU networks in the infinite width limit (red), signal remains unchanged across l. Error bars are across all distinct pairs of manifolds/digits. (B) Dimension as a function of l. In the infinite width limit, dimension remains constant with l in linear networks (purple) and increases with l in ReLU networks (red). In the thermodynamic limit, dimension decreases in linear networks (yellow), and is non-monotonic in ReLU networks (blue), similar to Fig. 3C. Error bars are across all manifolds/digits.
Fig. 8.
Fig. 8.
(A) Comparison of the generalization error dynamics between a network fully trained under Langevin dynamics (Eq. 14, shown in red), and a network with a frozen at time t0 in the diffusive learning stage, and Wt randomly drifting afterward (shown in blue). τ denotes the difference between the current time t and t0. (B) Both the temporal correlation of the top right singular vector (ρu(τ)) and the correlation between u(t) and Y (ρu,Y) remain close to 1, representing the constant alignment between the Top Right singular vector of the representation and the training labels. Temporal correlation of the Top Left singular vector (ρv(τ)) gradually decreases with the time difference τ, representing the random drift in the feature space.

References

    1. Zador A., et al. , Catalyzing next-generation artificial intelligence through neuroAI. Nat. Commun. 14, 1597 (2023). - PMC - PubMed
    1. Sejnowski T. J., The unreasonable effectiveness of deep learning in artificial intelligence. Proc. Natl. Acad. Sci. U.S.A. 117, 30033–30038 (2020). - PMC - PubMed
    1. C. Tan et al.., “A survey on deep transfer learning” in Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4–7, 2018, Proceedings, Part III 27, N. Lawrence, Eds. (Springer, 2018), pp. 270–279.
    1. Wang Y., Yao Q., Kwok J. T., Ni L. M., Generalizing from a few examples: A survey on few-shot learning. ACM Comp. Surv. (CSUR) 53, 1–34 (2020).
    1. Bernardi S., et al. , The geometry of abstraction in the hippocampus and prefrontal cortex. Cell 183, 954–967 (2020). - PMC - PubMed

LinkOut - more resources