. 2023 Sep 21;8(5):445.

doi: 10.3390/biomimetics8050445.

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Matteo Priorelli¹, Giovanni Pezzulo², Ivilin Peev Stoianov¹

Affiliations

¹ Institute of Cognitive Sciences and Technologies, National Research Council of Italy, 35137 Padova, Italy.
² Institute of Cognitive Sciences and Technologies, National Research Council of Italy, 00185 Rome, Italy.

PMID: 37754196
PMCID: PMC10526497
DOI: 10.3390/biomimetics8050445

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Matteo Priorelli et al. Biomimetics (Basel). 2023.

. 2023 Sep 21;8(5):445.

doi: 10.3390/biomimetics8050445.

Authors

Matteo Priorelli¹, Giovanni Pezzulo², Ivilin Peev Stoianov¹

Affiliations

¹ Institute of Cognitive Sciences and Technologies, National Research Council of Italy, 35137 Padova, Italy.
² Institute of Cognitive Sciences and Technologies, National Research Council of Italy, 00185 Rome, Italy.

PMID: 37754196
PMCID: PMC10526497
DOI: 10.3390/biomimetics8050445

Abstract

Depth estimation is an ill-posed problem; objects of different shapes or dimensions, even if at different distances, may project to the same image on the retina. Our brain uses several cues for depth estimation, including monocular cues such as motion parallax and binocular cues such as diplopia. However, it remains unclear how the computations required for depth estimation are implemented in biologically plausible ways. State-of-the-art approaches to depth estimation based on deep neural networks implicitly describe the brain as a hierarchical feature detector. Instead, in this paper we propose an alternative approach that casts depth estimation as a problem of active inference. We show that depth can be inferred by inverting a hierarchical generative model that simultaneously predicts the eyes' projections from a 2D belief over an object. Model inversion consists of a series of biologically plausible homogeneous transformations based on Predictive Coding principles. Under the plausible assumption of a nonuniform fovea resolution, depth estimation favors an active vision strategy that fixates the object with the eyes, rendering the depth belief more accurate. This strategy is not realized by first fixating on a target and then estimating the depth; instead, it combines the two processes through action-perception cycles, with a similar mechanism of the saccades during object recognition. The proposed approach requires only local (top-down and bottom-up) message passing, which can be implemented in biologically plausible neural circuits.

Keywords: action-perception cycles; active inference; active vision; depth perception; predictive coding.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Figures

**Figure 1**
Information processing in neural networks (**left**) and Predictive Coding (**right**). In a neural network, the visual observation $s_{v}$ travels through the cortical hierarchy in a bottom-up way, detecting increasingly more complex features $x^{(i, j)}$ and eventually estimating the depth of an object $d$ . The descending projections are considered here as feedback signals that convey backpropagation errors. In contrast, in a Predictive Coding Network the depth $d$ is a high-level belief generating a visual prediction that is compared with the observation. This process leads to a cascade of prediction errors $ε^{(i, j)}$ associated with each intermediate prediction $x^{(i, j)}$ that are minimized throughout the hierarchy, eventually inferring the correct belief (for details, see the Section 2).

**Figure 2**
(A) An example of a portion of a kinematic plant. (B) Factor graph of a single level j of the hierarchical kinematic model composed of intrinsic $μ_{i}^{(j)}$ and extrinsic $μ_{e}^{(j)}$ beliefs. These beliefs generate proprioceptive and visual predictions $p_{p}^{(j)}$ and $p_{v}^{(j)}$ through generative models $g_{p}$ and $g_{v}$ , respectively. Furthermore, the beliefs predict trajectories (here, only the velocities $μ {^{'}}_{i}^{(j)}$ and $μ'_{e}^{(j)}$ ) through the dynamics functions $f_{i}^{(j)}$ and $f_{e}^{(j)}$ . Note that the extrinsic belief of level $j - 1$ acts as a prior for layer j through a kinematic generative model $g_{e}$ . See [21] for more details.

**Figure 3**
Projection of a 3D point in the camera plane (only two dimensions are shown). The y coordinates of the real point $r_{y}$ and projected point $p_{y}$ are related by the ratio between the focal length f and the real point depth $r_{z}$ .

**Figure 4**
Representation of the hierarchical relationships of a generalized model with homogeneous transformations. The belief over a reference frame $μ_{r}$ of level i is passed to a function $g_{t}$ encoding a homogeneous transformation along with a belief over a particular transform $μ_{t}$ (e.g., the angle for rotation or length for translation), generating the reference frame of level $i + 1$ .

**Figure 5**
Neural-level implementation of a hierarchical generative model to estimate the depth of a point through Active Inference. The small squares indicate inhibitory connections. Unlike a neural network, depth is estimated by first generating two predictions $p_{r}^{(i)}$ of the point relative to each eye from a point in the absolute coordinates $μ_{a}$ and vergence-accommodation angles $μ_{θ}$ . This new belief is in turn used to compute a projection $p_{c}^{(i)}$ and finally a visual prediction $p_{v}^{(i)}$ . The predictions are then compared with the visual observations, generating prediction errors throughout the hierarchy and eventually driving the beliefs at the top toward the correct values. Note that eye movements are directly triggered to suppress the proprioceptive prediction error $ε_{p}$ . Intentional eye movements (e.g., for target fixation) can instead be achieved by setting a prior in the dynamics function $f_{c}$ of the belief over the projected point $μ_{c}$ (note that for better readability the figure only shows the dynamics function $f_{c}$ .

**Figure 6**
Sequence of time frames of a depth estimation task with simultaneous target fixation. The agent uses alternating action–perception phases to avoid becoming stuck during the minimization process. Each frame is composed of three images: a third-view perspective of the overall task (**top**) and a first-view perspective consisting of the projection of the target to the respective camera planes of each eye (**bottom left and bottom right**). In the top panel, the eyes are represented by blue circles and the real and estimated target positions are shown in red and orange. The fixation trajectory (when vergence occurs) is represented in cyan. The thin blue lines are the fixation angles of the eyes. In the bottom panel, the real and estimated target positions are shown in red and orange. The abscissa and ordinate respectively represent the target depth and its projection.

**Figure 7**
Simulation results. Performance of the depth estimation task with nonuniform (**top**) and uniform (**bottom**) foveal resolution during inference with the eyes parallel and fixed (*infer parallel*), inference with the eyes fixating on the target (*infer vergence*), and simultaneous inference and target fixation (*active vision*). The accuracy (**left panel**) measures the number of trials in which the agent successfully predicts the 2D position of the target, the mean error (**middle panel**) measures the distance between the real and estimated target positions at the end of every trial, and the time (**right panel**) measures the number of steps needed to correctly estimate the target.

See this image and copyright information in PMC

Cited by

Deep kinematic inference affords efficient and scalable control of bodily movements.
Priorelli M, Pezzulo G, Stoianov IP. Priorelli M, et al. Proc Natl Acad Sci U S A. 2023 Dec 19;120(51):e2309058120. doi: 10.1073/pnas.2309058120. Epub 2023 Dec 12. Proc Natl Acad Sci U S A. 2023. PMID: 38085784 Free PMC article.
Embodied decisions as active inference.
Priorelli M, Stoianov IP, Pezzulo G. Priorelli M, et al. PLoS Comput Biol. 2025 Jun 18;21(6):e1013180. doi: 10.1371/journal.pcbi.1013180. eCollection 2025 Jun. PLoS Comput Biol. 2025. PMID: 40531985 Free PMC article.
Deep Hybrid Models: Infer and Plan in a Dynamic World.
Priorelli M, Stoianov IP. Priorelli M, et al. Entropy (Basel). 2025 May 27;27(6):570. doi: 10.3390/e27060570. Entropy (Basel). 2025. PMID: 40566157 Free PMC article.
Pose Estimation of a Cobot Implemented on a Small AI-Powered Computing System and a Stereo Camera for Precision Evaluation.
Cabrera-Rufino MA, Ramos-Arreguín JM, Aceves-Fernandez MA, Gorrostieta-Hurtado E, Pedraza-Ortega JC, Rodríguez-Resendiz J. Cabrera-Rufino MA, et al. Biomimetics (Basel). 2024 Oct 9;9(10):610. doi: 10.3390/biomimetics9100610. Biomimetics (Basel). 2024. PMID: 39451816 Free PMC article.

References

1. Qian N. Binocular disparity and the perception of depth. Neuron. 1997;18:359–368. doi: 10.1016/S0896-6273(00)81238-6. - DOI - PubMed
1. Parker A.J. Binocular depth perception and the cerebral cortex. Nat. Rev. Neurosci. 2007;8:379–391. doi: 10.1038/nrn2131. - DOI - PubMed
1. Durand J.B., Nelissen K., Joly O., Wardak C., Todd J.T., Norman J.F., Janssen P., Vanduffel W., Orban G.A. Anterior Regions of Monkey Parietal Cortex Process Visual 3D Shape. Neuron. 2007;55:493–505. doi: 10.1016/j.neuron.2007.06.040. - DOI - PMC - PubMed
1. Welchman A.E., Deubelius A., Conrad V., Bülthoff H.H., Kourtzi Z. 3D shape perception from combined depth cues in human visual cortex. Nat. Neurosci. 2005;8:820–827. doi: 10.1038/nn1461. - DOI - PubMed
1. Wismeijer D.A., Van Ee R., Erkelens C.J. Depth cues, rather than perceived depth, govern vergence. Exp. Brain Res. 2008;184:61–70. doi: 10.1007/s00221-007-1081-2. - DOI - PMC - PubMed

Grants and funding

820213/ERC_/European Research Council/International

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Affiliations

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources