Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 21;8(5):445.
doi: 10.3390/biomimetics8050445.

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Affiliations

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Matteo Priorelli et al. Biomimetics (Basel). .

Abstract

Depth estimation is an ill-posed problem; objects of different shapes or dimensions, even if at different distances, may project to the same image on the retina. Our brain uses several cues for depth estimation, including monocular cues such as motion parallax and binocular cues such as diplopia. However, it remains unclear how the computations required for depth estimation are implemented in biologically plausible ways. State-of-the-art approaches to depth estimation based on deep neural networks implicitly describe the brain as a hierarchical feature detector. Instead, in this paper we propose an alternative approach that casts depth estimation as a problem of active inference. We show that depth can be inferred by inverting a hierarchical generative model that simultaneously predicts the eyes' projections from a 2D belief over an object. Model inversion consists of a series of biologically plausible homogeneous transformations based on Predictive Coding principles. Under the plausible assumption of a nonuniform fovea resolution, depth estimation favors an active vision strategy that fixates the object with the eyes, rendering the depth belief more accurate. This strategy is not realized by first fixating on a target and then estimating the depth; instead, it combines the two processes through action-perception cycles, with a similar mechanism of the saccades during object recognition. The proposed approach requires only local (top-down and bottom-up) message passing, which can be implemented in biologically plausible neural circuits.

Keywords: action-perception cycles; active inference; active vision; depth perception; predictive coding.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Figures

Figure 1
Figure 1
Information processing in neural networks (left) and Predictive Coding (right). In a neural network, the visual observation sv travels through the cortical hierarchy in a bottom-up way, detecting increasingly more complex features x(i,j) and eventually estimating the depth of an object d. The descending projections are considered here as feedback signals that convey backpropagation errors. In contrast, in a Predictive Coding Network the depth d is a high-level belief generating a visual prediction that is compared with the observation. This process leads to a cascade of prediction errors ε(i,j) associated with each intermediate prediction x(i,j) that are minimized throughout the hierarchy, eventually inferring the correct belief (for details, see the Section 2).
Figure 2
Figure 2
(A) An example of a portion of a kinematic plant. (B) Factor graph of a single level j of the hierarchical kinematic model composed of intrinsic μi(j) and extrinsic μe(j) beliefs. These beliefs generate proprioceptive and visual predictions pp(j) and pv(j) through generative models gp and gv, respectively. Furthermore, the beliefs predict trajectories (here, only the velocities μi(j) and μe(j)) through the dynamics functions fi(j) and fe(j). Note that the extrinsic belief of level j1 acts as a prior for layer j through a kinematic generative model ge. See [21] for more details.
Figure 3
Figure 3
Projection of a 3D point in the camera plane (only two dimensions are shown). The y coordinates of the real point ry and projected point py are related by the ratio between the focal length f and the real point depth rz.
Figure 4
Figure 4
Representation of the hierarchical relationships of a generalized model with homogeneous transformations. The belief over a reference frame μr of level i is passed to a function gt encoding a homogeneous transformation along with a belief over a particular transform μt (e.g., the angle for rotation or length for translation), generating the reference frame of level i+1.
Figure 5
Figure 5
Neural-level implementation of a hierarchical generative model to estimate the depth of a point through Active Inference. The small squares indicate inhibitory connections. Unlike a neural network, depth is estimated by first generating two predictions pr(i) of the point relative to each eye from a point in the absolute coordinates μa and vergence-accommodation angles μθ. This new belief is in turn used to compute a projection pc(i) and finally a visual prediction pv(i). The predictions are then compared with the visual observations, generating prediction errors throughout the hierarchy and eventually driving the beliefs at the top toward the correct values. Note that eye movements are directly triggered to suppress the proprioceptive prediction error εp. Intentional eye movements (e.g., for target fixation) can instead be achieved by setting a prior in the dynamics function fc of the belief over the projected point μc (note that for better readability the figure only shows the dynamics function fc.
Figure 6
Figure 6
Sequence of time frames of a depth estimation task with simultaneous target fixation. The agent uses alternating action–perception phases to avoid becoming stuck during the minimization process. Each frame is composed of three images: a third-view perspective of the overall task (top) and a first-view perspective consisting of the projection of the target to the respective camera planes of each eye (bottom left and bottom right). In the top panel, the eyes are represented by blue circles and the real and estimated target positions are shown in red and orange. The fixation trajectory (when vergence occurs) is represented in cyan. The thin blue lines are the fixation angles of the eyes. In the bottom panel, the real and estimated target positions are shown in red and orange. The abscissa and ordinate respectively represent the target depth and its projection.
Figure 7
Figure 7
Simulation results. Performance of the depth estimation task with nonuniform (top) and uniform (bottom) foveal resolution during inference with the eyes parallel and fixed (infer parallel), inference with the eyes fixating on the target (infer vergence), and simultaneous inference and target fixation (active vision). The accuracy (left panel) measures the number of trials in which the agent successfully predicts the 2D position of the target, the mean error (middle panel) measures the distance between the real and estimated target positions at the end of every trial, and the time (right panel) measures the number of steps needed to correctly estimate the target.

Similar articles

Cited by

References

    1. Qian N. Binocular disparity and the perception of depth. Neuron. 1997;18:359–368. doi: 10.1016/S0896-6273(00)81238-6. - DOI - PubMed
    1. Parker A.J. Binocular depth perception and the cerebral cortex. Nat. Rev. Neurosci. 2007;8:379–391. doi: 10.1038/nrn2131. - DOI - PubMed
    1. Durand J.B., Nelissen K., Joly O., Wardak C., Todd J.T., Norman J.F., Janssen P., Vanduffel W., Orban G.A. Anterior Regions of Monkey Parietal Cortex Process Visual 3D Shape. Neuron. 2007;55:493–505. doi: 10.1016/j.neuron.2007.06.040. - DOI - PMC - PubMed
    1. Welchman A.E., Deubelius A., Conrad V., Bülthoff H.H., Kourtzi Z. 3D shape perception from combined depth cues in human visual cortex. Nat. Neurosci. 2005;8:820–827. doi: 10.1038/nn1461. - DOI - PubMed
    1. Wismeijer D.A., Van Ee R., Erkelens C.J. Depth cues, rather than perceived depth, govern vergence. Exp. Brain Res. 2008;184:61–70. doi: 10.1007/s00221-007-1081-2. - DOI - PMC - PubMed

LinkOut - more resources