Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Sep 29;117(39):24581-24589.
doi: 10.1073/pnas.2008961117. Epub 2020 Sep 16.

Hierarchical structure is employed by humans during visual motion perception

Affiliations

Hierarchical structure is employed by humans during visual motion perception

Johannes Bill et al. Proc Natl Acad Sci U S A. .

Abstract

In the real world, complex dynamic scenes often arise from the composition of simpler parts. The visual system exploits this structure by hierarchically decomposing dynamic scenes: When we see a person walking on a train or an animal running in a herd, we recognize the individual's movement as nested within a reference frame that is, itself, moving. Despite its ubiquity, surprisingly little is understood about the computations underlying hierarchical motion perception. To address this gap, we developed a class of stimuli that grant tight control over statistical relations among object velocities in dynamic scenes. We first demonstrate that structured motion stimuli benefit human multiple object tracking performance. Computational analysis revealed that the performance gain is best explained by human participants making use of motion relations during tracking. A second experiment, using a motion prediction task, reinforced this conclusion and provided fine-grained information about how the visual system flexibly exploits motion structure.

Keywords: Bayesian inference; generative models; hierarchical structure; motion perception; multiple object tracking.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Modular representation of hierarchical motion structure. (A) Observed velocity components of a running human. The hands inherit motion from the arms, which inherit motion from the torso. (B) Corresponding nested hierarchy of motion relations. Observed velocity is the sum of local motion components. (C) Motion graph describing global motion with a strong (strength λglo) shared motion source (gray node) and weaker (strength λind) individual motion sources (orange nodes). Here, the global motion source is not directly observed (i.e., latent) but introduces correlations in the motion of observable objects (orange). (D) Two motion clusters (Top) can be embedded into a deep hierarchy by adding another latent motion source at the tree’s root (Bottom). (E) Illustration of a stimulus of stochastically rotating dots with global motion structure. These are the class of stimuli used in the experiments.
Fig. 2.
Fig. 2.
Use of motion structure knowledge during MOT. (A) Tested motion conditions included IND, GLO, CNT, and CDH motion; s mark targets. Two different target sets were tested for CDH motion. (B) Average performance (number of correctly identified targets) on different motion conditions by human participants, the Bayesian computational observer model using the correct motion structure, and the Bayesian computational observer model disregarding motion relations (IND prior). Using motion structure during inference is required to explain human performance gains on motion-structured stimuli. (C) The observer model consists of a Kalman filter with motion structure prior L (Left) and a mental assignment of dot identities (Center). Perceptual and neural noise can lead to ambiguous assignments and, ultimately, errors in the reported target set (Right). (D) Fraction of trials with zero, one, two, and three dots correct (red, orange, light green, and dark green, respectively) for human participants, the observer model with the correct motion prior, and the model with an IND prior.
Fig. 3.
Fig. 3.
Functional components underlying human MOT. (A) Alternative observer models with components added to or removed from the computational observer. The “momentum-free observer” lacks the concept of inertia. The “Weber’s law observer” adds two known psychophysical constraints: velocity-dependent observation noise and stochastic decision noise. (B) Average MOT performance on different motion conditions for the alternative observer models when either the correct motion structure or an IND prior are assumed (like Fig. 2B). Employing trajectory extrapolation is computationally indispensable on the circle, as highlighted by the momentum-free observer’s performance dropping to chance in the IND stimulus condition. Adding psychophysical constraints, in contrast, leads to an improved match with human performance. (C) Bayesian model comparison of employed motion structure priors based on the exact per-trial choice sets. Shown are log-likelihood ratios of human choice sets under different putative motion priors L relative to the true structure underlying the stimulus. Negative values indicate that the participant’s behavior was better explained by the correct motion prior. Human participants make use of motion structure knowledge during tracking, presumably employing approximately correctly structured priors. Yet, the limited information provided by the discrete response sets prevented insight into the exact structural features used. Each dot represents one participant per stimulus condition and putative motion prior. Horizontal lines show mean log-likelihood ratios across participants (values in parentheses). Asterisks indicate significance of paired t tests (p<0.05,0.01,103, and 104 for one, two, three, and four asterisks, respectively; ns, not significant).
Fig. 4.
Fig. 4.
Revealing human motion priors in a multiple object prediction task. (A) Illustration of the stimuli. The highlighted green and red dots disappeared after 5 s. Participants had to predict their location at the end of the trial. Dots were color-coded to indicate their role in the motion structure. Here, a GLO stimulus condition is illustrated. (B) Mean-squared prediction error of the green and red dots for all participants in the GLO condition. Due to task-inherent uncertainty, even perfect inference, as given by the mean values of a Kalman filter with the correct motion structure prior, will exhibit nonzero prediction errors (labeled “Bayes opt.”). Humans do not reach this optimal accuracy, but perform better than chance. (C) Human responses (dots) relative to the predictions of an observer model with correct GLO prior, in all 100 trials for the participant highlighted in B and E. The fitted observer model (ellipses) predicts human responses well (ellipses indicate 1, 2, 3 SD). (D) Same as C, but for an observer model assuming an IND motion prior. Neither the predicted locations nor the covariance in human prediction errors is captured by the model. (E) (Left) Tested motion conditions included GLO, CLU, and CDH motion. (Top) Putative motion priors tested for explaining human responses via a Bayesian observer model, ordered by their complexity. (Main Panel) Each cell shows per-participant log-likelihood model fit ratio for a particular motion prior, compared to the correct prior underlying the stimulus (indicated by gray background). Negative values indicate that the participant’s behavior was better explained by the correct motion prior. Humans flexibly employed correctly structured motion priors. Each dot represents one participant per stimulus condition and putative motion prior (comparison between C and D highlighted in orange). Horizontal lines show mean log-likelihood ratios across participants (values in parentheses). Asterisks indicate significance of paired t tests (p<103, and 104, respectively).
Fig. 5.
Fig. 5.
Bias–variance decomposition for the prediction task. (A) Human prediction errors are assumed to be the sum of a systematic (bias) and a stochastic (noise) component. The relative contributions of each component can be estimated from repetitions of the same trial, leading to responses φ(1) and φ(2), and associated errors Δ(1) and Δ(2). (B) Noise factors for the green and the red dots, one marker per participant (color and filling) and motion condition (shape). A combination of systematic and stochastic errors underlies human suboptimality. Black cross and ellipsoids indicate mean and iso-density curves (1 to 3 SDs) of a bivariate Gaussian fitted to all noise factors.

References

    1. Kaiser D., Quek G. L., Cichy R. M., Peelen M. V., Object vision in a structured world. Trends Cognit. Sci. 23, 672–685 (2019). - PMC - PubMed
    1. Wertheimer M., Untersuchungen zur Lehre von der Gestalt. Psychol. Res. 4, 301–350 (1923).
    1. Yantis S., Multielement visual tracking: Attention and perceptual organization. Cognit. Psychol. 24, 295–340 (1992). - PubMed
    1. Liu G., et al. , Multiple-object tracking is based on scene, not retinal, coordinates. J. Exp. Psychol. Hum. Percept. Perform. 31, 235–247 (2005). - PubMed
    1. Suganuma M., Yokosawa K., Grouping and trajectory storage in multiple object tracking: Impairments due to common item motions. Perception 35, 483–495 (2006). - PubMed

Publication types

LinkOut - more resources