Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Apr 2:2023.08.22.554287.
doi: 10.1101/2023.08.22.554287.

A mathematical theory of relational generalization in transitive inference

Affiliations

A mathematical theory of relational generalization in transitive inference

Samuel Lippl et al. bioRxiv. .

Update in

Abstract

Humans and animals routinely infer relations between different items or events and generalize these relations to novel combinations of items. This allows them to respond appropriately to radically novel circumstances and is fundamental to advanced cognition. However, how learning systems (including the brain) can implement the necessary inductive biases has been unclear. Here we investigated transitive inference (TI), a classic relational task paradigm in which subjects must learn a relation (A > B and B > C) and generalize it to new combinations of items (A > C). Through mathematical analysis, we found that a broad range of biologically relevant learning models (e.g. gradient flow or ridge regression) perform TI successfully and recapitulate signature behavioral patterns long observed in living subjects. First, we found that models with item-wise additive representations automatically encode transitive relations. Second, for more general representations, a single scalar "conjunctivity factor" determines model behavior on TI and, further, the principle of norm minimization (a standard statistical inductive bias) enables models with fixed, partly conjunctive representations to generalize transitively. Finally, neural networks in the "rich regime," which enables representation learning and has been found to improve generalization, unexpectedly show poor generalization and anomalous behavior. We find that such networks implement a form of norm minimization (over hidden weights) that yields a local encoding mechanism lacking transitivity. Our findings show how minimal statistical learning principles give rise to a classical relational inductive bias (transitivity), explain empirically observed behaviors, and establish a formal approach to understanding the neural basis of relational abstraction.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Transitive inference and behavioral patterns observed in subjects performing this task. a, Example stimuli taken from [31]. An example generalization (B > E) is highlighted. b, In a given trial, the subject is asked to choose between the two presented items. They are rewarded if the chosen item is “larger” according to the underlying hierarchy. This panel depicts a test trial where successful performance would consist in picking the item on the right. c, Schematic of the training and test cases. d, Example accuracy on all training and test cases (in this case by rhesus macaques). Terminal item, symbolic distance, and asymmetry effect are apparent in the plot. In the subsequent figures, we leave off the item pair labels but use the same ordering. Symbolic distance (SD) is the separation in the rank hierarchy. The data is reproduced from [31]. e, Symbolic distance effect with and without a memorization effect (ME). Data without ME reproduced from [31] and data with ME reproduced from [29]. Shaded regions in both panels indicate mean ± one standard error.
Figure 2:
Figure 2:
An additive representation yields transitive generalization and the symbolic distance effect. a, Schematic illustration of an additive representation (rep.). The first three orange nodes represent the first item (X) and the latter three represent the second item (Y). b, Replacing the second item only results in changes to the highlighted half of the units. c, The readout weights of the model can be grouped into those pertaining to item X and those pertaining to item Y. The model’s output can be understood as assigning a “rank” r(X) and r(Y) to each item and then computing r(X)-r(Y). d, Example of a model’s rank assignment.
Figure 3:
Figure 3:
Non-additive representations of TI cases. a, A non-additive representation encodes nonlinear interactions between items. b, A one-hot representation represents each combination of items by a distinct node. c, Representational similarity (cross-correlation) between a subset of trials in a ReLU network with 50,000 units. d, Representational similarity between all possible trials in networks with different numbers of hidden units, organized according to the type of trials. e, The conjunctivity factor characterizes a given representation according to how similarly it represents overlapping trials. Additive representations lie at one end of the spectrum, whereas one-hot representations lie at the other end.
Figure 4:
Figure 4:
We analyze the behavior of models having readout weights trained with norm minimization. a, Schematic illustration of the setup. b, Norm minimization implements a useful inductive bias for generalization to nearby data points. On categorization tasks, it determines the hyperplane separating the two categories with the maximal margin. c, Intuitively, a partly conjunctive representation is given by an item-wise representation of X and Y concatenated with a fully conjunctive representation of X and Y. The readout from the item-wise representations computes a rank for each item that transfers to overlapping pairs. The readout from the fully conjunctive representation memorizes a response to a given pair and does not transfer to overlapping pairs. Because norm minimization encourages distributed weights, it finds a solution that partly uses the item-wise representation and hence computes a rank. This leads to transitive generalization. d,e, The emergent rank representation at the end of training (Eq. 2) for (d) seven items and different values for α; and (e) α=0.1 and different numbers of items. f, For seven items and different values for α, the corresponding margin for all trials. Item pairs are arranged by their position in the hierarchy, as in Fig. 1e.
Figure 5:
Figure 5:
TI behavior for models with a fixed representation and readout weights trained using regularized regression. a, Illustration of the setup. b,c, The (b) effective conjunctivity factor and (c) memorization coefficient as a function of the inverse regularization coefficient c. d, Generalization behavior for α=0.1 and different values of c. The margins overall become larger as c increases.
Figure 6:
Figure 6:
Behavior of models trained with readout weights trained through gradient flow on transverse patterning. a, Learning a non-transitive relation requires a conjunctive component. b, Margin over time for different values of α. c, Halftime, i.e. time until the model reaches a margin of 0.5, as a function of the conjunctivity factor α.
Figure 7:
Figure 7:
TI behavior in deep neural networks with modifiable representations. We considered twenty instances of a network with a ReLU nonlinearity and one hidden layer with 50,000 units, trained using mean squared error. Appendix S2.5 details other architectures, nonlinearities, and objective functions. Shaded regions indicate mean ± one standard deviation. (Note that some of the behaviors have sufficiently small variance that this shaded region may be invisible.) a, Illustration of network training. In contrast to the previous setups, the hidden layer weights were also trained using backpropagation. b, The mean squared error of the prediction made by the neural tangent kernel (NTK) at three different initialization scales (lazy: 1; rich: 10−3; very rich: 10−16). c,d, The (c) average test margin and (d) test accuracy as a function of initialization scale. The colored lines highlight the three representative values analyzed in more detail in panels b and e. e, TI performance according to our NTK-based prediction as well as of networks trained with backpropagation at the three representative scales.
Figure 8:
Figure 8:
A mechanistic analysis of the rich regime’s inductive bias. All plots show the mean ± standard deviation across twenty random initializations. Standard deviation is too small to be visible. a, 2-norm of of all weights in the fully trained networks as a function of initialization scale. The rich regime yields much smaller norm. b, Inertia (i.e. proportion of explained variance) as a function of the number of clusters, for the lazy and rich network. Six clusters only leave 0.001% of the variance unexplained. c, The empirical network is therefore described by a network with six units, three with positive and three with negative responses. The right panels show the weights of the different units and the bottom panel shows how each units responds to the different training trials. Only units with positive readout weights are shown; the units with negative readout weights have the same structure but with Item 1 and 2 reversed. d, Analogous depiction of the hand-constructed network, which has four units. e, The rank network only has two units, but they span a much wider range. f, Weight norm of the empirical, hand-constructed, and rank network as a function of the number of items. The rank network has a much larger 2-norm, whereas the norm of the empirical network is similar to that of the hand-constructed network.

Similar articles

References

    1. Halford G. S., Wilson W. H. & Phillips S. Relational knowledge: the foundation of higher cognition. en. Trends in Cognitive Sciences 14, 497–505. ISSN: 1364-6613. https://www.sciencedirect.com/science/article/pii/S1364661310002020 (2023) (Nov. 2010). - PubMed
    1. Cheney D. L., Seyfarth R. M. & Silk J. B. The responses of female baboons (Papio cynocephalus ursinus) to anomalous social interactions: Evidence for causal reasoning? Journal of Comparative Psychology 109, 134–141. ISSN: 1939-2087(Electronic),0735-7036(Print) (1995). - PubMed
    1. Peake T. M., Terry A. M. R., McGregor P. K. & Dabelsteen T. Do great tits assess rivals by combining direct experience with information gathered by eavesdropping? Proceedings of the Royal Society of London. Series B: Biological Sciences 269. Publisher: Royal Society, 1925–1929. 10.1098/rspb.2002.2112 (2022) (Sept. 2002). - DOI - PMC - PubMed
    1. Paz-y-Miño C G., Bond A. B., Kamil A. C. & Balda R. P. Pinyon jays use transitive inference to predict social dominance. en. Nature 430. Number: 7001 Publisher: Nature Publishing Group, 778–781. ISSN: 1476–4687. https://www.nature.com/articles/nature02723 (2022) (Aug. 2004). - PubMed
    1. Etienne A. S. & Jeffery K. J. Path integration in mammals. en. Hippocampus 14, 180–192. ISSN: 1098–1063. 10.1002/hipo.10173 (2022) (2004). - DOI - PubMed

Publication types