Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Oct;117(4-5):299-329.
doi: 10.1007/s00422-023-00966-9. Epub 2023 Jun 12.

Canonical circuit computations for computer vision

Affiliations
Review

Canonical circuit computations for computer vision

Daniel Schmid et al. Biol Cybern. 2023 Oct.

Abstract

Advanced computer vision mechanisms have been inspired by neuroscientific findings. However, with the focus on improving benchmark achievements, technical solutions have been shaped by application and engineering constraints. This includes the training of neural networks which led to the development of feature detectors optimally suited to the application domain. However, the limitations of such approaches motivate the need to identify computational principles, or motifs, in biological vision that can enable further foundational advances in machine vision. We propose to utilize structural and functional principles of neural systems that have been largely overlooked. They potentially provide new inspirations for computer vision mechanisms and models. Recurrent feedforward, lateral, and feedback interactions characterize general principles underlying processing in mammals. We derive a formal specification of core computational motifs that utilize these principles. These are combined to define model mechanisms for visual shape and motion processing. We demonstrate how such a framework can be adopted to run on neuromorphic brain-inspired hardware platforms and can be extended to automatically adapt to environment statistics. We argue that the identified principles and their formalization inspires sophisticated computational mechanisms with improved explanatory scope. These and other elaborated, biologically inspired models can be employed to design computer vision solutions for different tasks and they can be used to advance neural network architectures of learning.

Keywords: Binding; Feedback; Neural network; Neuromorphic computing; Perceptual grouping; Recurrent processing.

PubMed Disclaimer

Conflict of interest statement

The authors have no financial or non-financial interests to disclose.

Figures

Fig. 1
Fig. 1
Theories of function of neural architectures and their bottom-up and top-down computation. Over the last decades, different theories about generic principles governing neural circuits and their computations have been described. Explanations of some of the most influential theories are depicted (see individual text boxes for a summary). Common to all these theories is an account for recurrent, feedforward-feedback information processing. The different theories emphasize different properties of recurrent network interaction. Thus, they partially overlap in functionality, share similar principles or are complementary to each other. Sources for each theoretical framework are provided in the list of references
Fig. 2
Fig. 2
Convergent information streams of bottom-up and top-down information integrated at pyramidal cells in the cortex. a A cortical pyramidal neuron at level k of the visual hierarchy receives bottom-up driving inputs from lower layers of the hierarchy to its basal compartment bk. In addition, the neuron may receive additional inputs to its apical compartment ak via long-range connections from various sources. Each compartment integrates its input independent of the other and is potentially gated by inhibitory neurons. The details of the spatial arrangement and mutual interactions between interneurons and their influence on pyramidal cell compartments are omitted here; for more details, see for example Kirchberger et al. (2021). The excitatory inputs to different compartment are nonlinearly combined to generate an output at the soma. For a detailed investigation of the coupling and de-coupling of pyramidal cell compartments, see Suzuki and Larkum (2020). b Interactions between basal inputs b and apical inputs a yield asymmetric response characteristics. The basal input is sufficient to feed the cell generating a response. This response is nonlinearly amplified when apical input is present simultaneously. Such apical input alone, however, is not sufficient to generate a response. The computational logic is summarized in the table
Fig. 3
Fig. 3
Circuit model architecture for a network node. a Three processing elements constitute a node that represents the computation of a cortical column at an abstract level. In the first stage (I) filter responses of input cells generate the driving input to the node (two ellipses symbolize the subfield components of an exemplary filter). These units are laterally coupled in a recurrent field of nodes over a spatial neighborhood (each of those nodes receiving other filter input). The resulting responses define the driving input to the node. This activation is modulated by reentrant signals in the second stage (II). The table at the bottom characterizes the response modulation of feeding input b by reentrant signals a. It implements a simplified mechanism of feedforward and feedback integration as shown in Fig. 2b and eqs. 1, 2. The third stage (III) performs a normalization of activation by a pool of neurons. b Model architecture in which the three stages are condensed into an E-I circuit. Filtered input feeds an E-node which interacts laterally and is modulated by integrated contextual information. The pool is represented by the I-node. Input lines denote driving signals with excitatory and inhibitory influence (transfer functions are omitted). Each cell’s excitatory activation are enhanced by modulatory FB signals. Spatially arranged cortical columns are shown as laterally connected E-nodes. Each E-cell may incorporate self-excitation (resembling E-E connectivity) and each I-cell self-inhibition (resembling I-I connectivity) shown as dotted lines (sketches adapted from Brosch and Neumann (2014a))
Fig. 4
Fig. 4
Structural principles underlying feature binding mechanisms. Several principles of feature detection and integration are realized which are based on different structural principles of connectivity and flow of information processing. Feature items are integrated along feedforward signal pathways with convergent information flow (left). Activities that are generated to build a representation in a specific layer communicate their feature-specific activations via lateral connections linking relatable features (middle). Activities of higher-level representations can be re-entered into the response calculations at earlier layers through top-down feedback (right). The cones of convergent bottom-up and divergent top-down information flow share similar properties as in the selective tuning model of attention (Tsotsos et al. 1995). Here, we propose that the influence of the different signal flows is characterized by the different functions of driving and modulating signal interactions
Fig. 5
Fig. 5
Feature integration and disambiguation for base grouping. a A scene with 3D objects seen through circular apertures with different sizes (or scales; top). When the masking patches are removed (bottom) then contextual information signifies the relatability of visual items and provides the basis to disambiguate the local feature characteristics. The circular receptive field shapes are shown to illustrate the position of the apertures in the top image (the photograph of the object scene is reproduced with permission from Peterhans and von der Heydt (1991)). b A moving shape is shown as overlay of two temporal snapshots  (contour denoted by solid line at first time point and dashed line at second time point). Top: One part is visible through a single aperture at the bottom of the image. The ambiguity of a single aperture view leads to the perceived normal flow orthogonal to the horizontal boundary (continuous to dashed line). Bottom: If a second aperture at the top right reveals local feature motion which can be combined with the aperture normal flow at the bottom, then the shape appears to coherently move in a direction upward and to the right
Fig. 6
Fig. 6
Context-dependent perceptual binding for contour grouping. Neural responses in the early visual cortex depend on the context provided as input to neighboring neurons. a Different stimulus configurations are presented (top), while responses of a neuron in a monkey’s early visual cortex are recorded (bottom). A target cell is excited by an optimally oriented bar (configuration 1) that is placed in the receptive field of that cell (RF, square region in the center). The response can be enhanced by additional flanking items placed colinearly outside the RF (configurations 2 and 3) indicating the presence of an extended contour configuration. However, if the central bar item is part of a random texture composed of bars then the response of the target cell is reduced at the center (configuration 5). If the central cell is driven by an optimally oriented bar item, which is again supported by co-aligned flanking bars that form a fragmented continuous contour, then the suppressive effect of the random texture is compensated. The response can even exceed that of the initial oriented input. Configuration 6 to 8 indicate the presence of a perceptual boundary item amidst a cluttered scene (figure reproduced with permission from Kapadia et al. 1995). b Neural model simulations reproducing the main effects of neural contour grouping (left) based on inputs that replicate the experimental conditions in (a). The neuron model incorporates computational principles of recurrent interactions by modulatory feedback and pool normalization over a space-feature neighborhood (see main text for details). The response of a target cell that is driven by a single bar item is taken as reference level (anchored at 0). The suppressive effects of embedding the bar as part of a texture and the counter-balancing excitation effects of additional colinear boundary grouping are shown in the bar diagram. Black and gray bars denote two different model parametrizations (figures with adaptations are reproduced with permission from Neumann and Sepp (1999))
Fig. 7
Fig. 7
Texture boundary detection and segregation of figure from ground. Panels with texture patterns composed of oriented bars contain a rectangular figure that is either oriented horizontally or vertically. While homogeneity of background and figure is defined by an orientation gradient of the bar items, separation of figure from ground can be established if texture boundaries can be detected from changes in orientation contrasts. In a modelling study, a neural network that utilizes recurrent interactions by modulatory feedback and pool normalization over a space-feature neighborhood has shown to be capable of such texture boundary detection and segregation (see main text for details). a Two example displays with different orientation changes within homogeneous regions (BN, background noise) and between homogeneous regions (OC, orientation contrast). OC and BN values determine orientation changes between neighboring bars (in degree). Dashed red rectangles indicate the outline of the figure shape (shown for illustration purposes only). b Response of the highest layer of the three-layer network (referred to cortical area V4) for a stimulus defined by (OC=30,BN=0) (top) and visualization of the activity components used to build a metric for determining the model’s segregation performance (bottom). c Model performance (bottom) across different OC values for a given value of BN=20 for the intact network and different model ablations (top, depicted by missing arrows between network layers). The intact model (full recurrent), as well as a variant with missing feedback (FB) from the second to the first layer (w/o V2 V1 FB) reach high performance values once the OC value surpasses the BN value. In contrast, ablations of FB from the third layer to the second layer (w/o V4 V2 FB) or removing all FB connections (pure feedforward) merely result in enhanced activity of figure versus surround. This demonstrates the relevance of higher-level contextual feedback information for texture boundary detection and segregation in cases of noisy cluttered input (figures reproduced with permission from Thielscher and Neumann (2003, 2005))
Fig. 8
Fig. 8
Motion binding and perceptual disambiguation for planar motions. The proposed neural architecture incorporates recurrent interactions by modulatory feedback and pool normalization over a space-feature neighborhood (see main text for details). The function of the architecture successfully infers a wide range of planar motion patterns. a Translatory motion of elongated shapes is processed by neurons having only small receptive fields such that only the motion component normal to the outline can be encoded (aperture problem; top (Adelson and Bergen ; Watson and Ahumada 1985)). Iterative feedforward-feedback interaction initiates spreading unambiguous feature motion signals at line ends along shape outlines. This disambiguates local motion estimates solving the aperture problem to build a representation of a coherently moving bar (shown is a sequence of four timesteps for a moving bar from bottom left to top right, Löhr et al. (2019); bottom). Brighter values encode smaller average angular error of the motion direction estimate at each pixel. b The gradual binding and disambiguation also helps to build representation of rotational object motion. A windmill stimulus with counter-clockwise rotation is represented by different velocities with increasing angular speed as a function of the radius of the position on the arms of the windmill (top). The model framework encodes the motion for different directions (colorwheel) and in different speed sensitive channels (Löhr et al. 2019). Integrating neural responses from a population of neurons with similar direction but different speed selectivities (blue) leads to inference of the shape motion (red). The reference shows the true tangential velocities with their speed gradient as ground truth (green; bottom) (figures reprinted from Löhr et al. (2019) with permission)
Fig. 9
Fig. 9
The role of feedback in motion detection and integration. Feedback is shown to play a vital role in neural motion detection and integration. We selectively lesioned feedback connections in a 3-layer model architecture (V1-MT-MSTl) that has been used to process input shown in Fig. 8. The elimination of signal pathways severely degrades performance. a We consider the complete model as reference with all feedforward (FF), lateral, and feedback (FB) connections intact. The graph shows the effects of removing FB connections, namely removing FB to the second layer (No FB into MT), removing FB from the second to the first layer (No FB into V1), or removing both FB connections (No FB to V1 & MT). The different graphs show how these manipulations severely impair the network performance through significant increases of the direction error (Löhr et al. 2019) (figure reproduced with permission). b In displays of the Curveball illusion component motion of a grating producing a horizontally traveling wave is overlayed by a Gaussian envelope that moves vertically. Together this leads to a perceived diagonally moving object with the orientation depending on the relative velocities of both components (left). The neural model above predicts similar perceived motion trajectories where the representation at the higher stage (MT) showed a displaced estimated motion trajectory akin to the perceptual phenomenon, while the lower stage (V1) more closely followed the true envelope motion of the Curveball object (solid lines, right). If FB from MT to V1 was removed (dashed lines), the deviation of motion prediction became large in MT while the effect in V1 was minute (Schmid et al. 2019). c In a random-dot kinematogram (RDK) increasingly more dots of the overall display population change their motion from an initial direction assigned to all dots (i.e., left or right) until all dots move in the opposite direction. Observers watching such motion sequences tend to perceive the same motion direction until an overwhelming number of elements signals the opposite (inlet arrows ”a”, ”b”). Such a hysteresis effect is demonstrated in a two-layer motion model composed of areas V1 and MT (MSTl has been omitted here). The decision is made on the stronger integrated response for the two opposite motion directions integrated over the display panel. The model tracks the motion with a certain momentum before updating the belief state rapidly to accommodate for detecting the opposite motion direction (top). If MTV1 FB is lesioned the hysteresis effect is extinguished. Only the relative percentage of motion contribution to either direction is signaled, like in a linear filter (bottom) (panels reprinted with permission from Bayerl and Neumann (2004))
Fig. 10
Fig. 10
Event-based motion detection and integration. The core mechanisms of the neural model architecture for motion detection (i) allow the processing of camera sensor input that generates address-event representations and (ii) is a candidate for implementing the model on an energy-efficient neuromorphic hardware platform Brosch et al. (2015a). A simplified two-layer (V1, MT) model implementation again has been probed with different types of image motion. For rotational motion (as in the windmill stimulus in Fig. 8) neurons sparsely represent motion at pixels having a spatiotemporal intensity gradient. For rotatory planar motion, as for the presented windmill stimulus, neurons only code for and integrate motion information sparsely at pixels with such space-time gradient in the input. Initial motion estimates in V1 (left) are propagated and integrated on a coarser spatial scale in MT to generate coherent motion estimates (center). These responses are then propagated back to V1 via feedback. In V1 they reduce spurious responses and generate sparsified but coherent representation of spatiotemporal movements along the windmill arms while the background remains silent (panels reproduced with permission from Brosch et al. (2015a))
Fig. 11
Fig. 11
Motion adaptation as a function of input scene statistics. a Model architecture with two layers of motion-sensitive cells (V1 and MT) and a final decision layer. The motion-sensitive model areas implement the canonical computations of feedforward filtering, feedback modulation, and pool normalization. The focus of the investigation was on the principle of response adaptation, utilizing dynamic synapses along the feedforward and feedback pathway, respectively (red circles). The strength of connection weights adapted to the input statistics by synaptic vesicle depletion that reduces their transmission efficacy over time. The temporal dynamics of this depletion are shown in the small graphs at the top and bottom. b The effect of adaptation to down-skewed (DSK) and up-skewed (USK) image sequences were assessed by probing the response to random dot kinematograms (RDKs). Psychophysical response curves show the ratio of upward responses of human participants and different model variants for different average motion directions θ. Models in which only feedforward synapses were adaptive (models 1 and 2) or in which feedforward and feedback synapses had the same time scale of adaptation (models 3 and 4) did not capture experimental data well. A model that combined fast feedforward and slow feedback adaptation matched experimental results best (model 5; figures are adapted from Habtegiorgis et al. (2019), with permission)

References

    1. Abbott L, Varela J, Sen K, et al. Synaptic depression and cortical gain control. Science. 1997;275:220–224. doi: 10.1126/science.275.5297.221. - DOI - PubMed
    1. Adelson E, Bergen J. Spatiotemporal energy models for the perception of motion. J Opt Soc Am A. 1985;2(2):284–299. doi: 10.1364/JOSAA.2.000284. - DOI - PubMed
    1. Adelson E, Movshon J. Phenomenal coherence of moving visual patterns. Nature. 1982;300:523–525. doi: 10.1038/300523a0. - DOI - PubMed
    1. Anderson P. More is different. Science. 1972;177:393–396. doi: 10.1126/science.177.4047.393. - DOI - PubMed
    1. Anstis S. Imperceptible intersections: the chopstick illusion. In: Blake A, Troscianko T, editors. AI and the Eye. New Jersey: Wiley; 1990. pp. 105–117.

LinkOut - more resources