Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Oct 9;13(10):e1005743.
doi: 10.1371/journal.pcbi.1005743. eCollection 2017 Oct.

Object detection through search with a foveated visual system

Affiliations

Object detection through search with a foveated visual system

Emre Akbas et al. PLoS Comput Biol. .

Abstract

Humans and many other species sense visual information with varying spatial resolution across the visual field (foveated vision) and deploy eye movements to actively sample regions of interests in scenes. The advantage of such varying resolution architecture is a reduced computational, hence metabolic cost. But what are the performance costs of such processing strategy relative to a scheme that processes the visual field at high spatial resolution? Here we first focus on visual search and combine object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We develop a foveated object detector that processes the entire scene with varying resolution, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. We compared the foveated object detector against a non-foveated version of the same object detector which processes the entire image at homogeneous high spatial resolution. We evaluated the accuracy of the foveated and non-foveated object detectors identifying 20 different objects classes in scenes from a standard computer vision data set (the PASCAL VOC 2007 dataset). We show that the foveated object detector can approximate the performance of the object detector with homogeneous high spatial resolution processing while bringing significant computational cost savings. Additionally, we assessed the impact of foveation on the computation of bottom-up saliency. An implementation of a simple foveated bottom-up saliency model with eye movements showed agreement in the selection of top salient regions of scenes with those selected by a non-foveated high resolution saliency model. Together, our results might help explain the evolution of foveated visual systems with eye movements as a solution that preserves perceptual performance in visual search while resulting in computational and metabolic savings to the brain.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. The foveated visual field of the proposed object detector.
Square blue boxes with white borders at the center are foveal pooling regions. Around them are peripheral pooling regions which are radially elongated. The sizes of peripheral regions increase with distance to the fixation point which is at the center of the fovea. The color within the peripheral regions represent pooling weights.
Fig 2
Fig 2. Flowchart of the non-foveated sliding window (SW) model and the foveated object detector (FOD).
The feature extraction step is common to both models. First, the image is filtered with simple edge detection filters with different orientations, and gradient magnitude and orientation are estimated at each pixel. Then, the image is divided into small square boxes on a regular grid. Within each box, total gradient magnitude per orientation is computed, which results in a histogram. The output is a collection of feature maps for x, y locations and orientations. For simplicity, only one feature map (H) is shown as input to both models. Right side: Foveated Object Detector. The FOD has an initial fixation position that determines the pooling regions of the underlying histogram of gradient features. FOD’s templates are learned through training and are specific to each retinotopic location. The scores reflecting probability of target presence are used to guide saccades to the most likely target location. The object probability scores for each location are integrated across saccades and used for the final perceptual decision.
Fig 3
Fig 3. Histogram of oriented gradients (HoG) of a sample image.
Left: input image, right: HoG result. First, the input image is convolved with two 1-D filters, namely [+ 1 0 −1] and its transpose. The gradient magnitude and orientation at each pixel are estimated from the convolution results. Then, the image is divided into small, square bins. In each bin, an orientation histogram is computed, which shows the (relative) total gradient magnitude per orientation. Finally, the histogram in each bin is normalized by the total “energy” (e.g. 2 norm) of a 2x2 block containing the bin akin to divisive local contrast normalization. This final step is known as block normalization. On the right, each HoG bin is represented with short, oriented line segments where brightness encodes the magnitude of the associated orientation. Due to the block normalization, in homogeneous areas (e.g. top-right) all orientations have high and similar magnitudes. (Image source statement: the original picture on the left was taken by the first author.)
Fig 4
Fig 4. Ratio of mean average precision (AP) scores of FOD systems relative to that of the non-foveated SW system.
Graph shows two eye movement algorithms: maximum aposteriori probability (MAP) and random (RAND) and two starting points (C: center of the image; E: left or right edge of the image).
Fig 5
Fig 5. Area under the recall precision curve (AP scores) achieved by the non-foveated (SW) model and the foveated object detector with a Maximum a posteriori eye movement strategy and a starting fixation point to the side of the image (MAP-E).
Symbols represent each object class type. Identity (diagonal) line corresponds to equal performance across models.
Fig 6
Fig 6. FOD-DPM’s performance (mean AP over all 20 classes) as a function of number of fixations.
FOD-DPM achieves SW-DPM’s performance at 11 fixations and exceeds it with more fixations.
Fig 7
Fig 7. Per class AP scores achieved by FOD-DPM and non-foveated SW-DPM.
Fig 8
Fig 8. Fixation locations and bounding box predictions of the FOD for three different object classes (person, car and bicycle) but for the same image and initial point of fixation.
Top-left: original image (source: https://www.flickr.com/photos/kristoffer-trolle/27882648666/ with Creative Commons license.), top-right: person detection, bottom-left: car detections, bottom-right: bicycle detection. Yellow dots show fixation points, numbers in yellow fonts indicate the sequence of fixations and the bounding boxes are the final detections.
Fig 9
Fig 9. Performance comparison of the foveated saliency model versus the non-foveated saliency model.
We ran both models for the simple task of identifying the topmost salient location, on 100 natural images randomly selected from the PASCAL VOC 2007 dataset. The blue curve plots the average distance (in degrees) between the topmost salient locations, S1 and S2, found by the foveated and the non-foveated model, respectively, on the same image. Note that this location is unique and fixed for the non-foveated model while it changes for the foveated model as the model explores the image, i.e. makes more and more fixations. The red curve plots the average number of iso-orientation suppression operations of the foveated model relative to that of the non-foveated model. Again, the number of such operations for the non-foveated model is fixed but it changes for the foveated model with the number of fixations. Foveated model finds the same topmost salient location as the non-foveated model, after 16 fixations. Notably, after 8 fixations, the distance between S1 and S2 becomes less than 1 degree. The foveated model achieves this level of accuracy by doing 42% less iso-orientation suppression operations than the non-foveated model.
Fig 10
Fig 10. Illustration of the visual field of the model.
(a) The model is fixating at the red cross mark on the image (see Fig 8’s caption for the source of the image). (b) Visual field (Fig 1) overlaid on the image, centered at the fixation location. White line delineate the borders of pooling regions. Nearby pooling regions do overlap. The weights (Fig 1) of a pooling region sharply decrease outside of its shown borders. White borders are actually iso-weight contours for neighboring regions. Colored bounding boxes show the templates of three components on the visual field: red, a template within the fovea; blue and green, two peripheral templates at 2.8 and 7 degree periphery, respectively. (c, d, e) Zoomed in versions of the red (foveal), blue (peripheral) and green (peripheral) templates. The weights of a template, wi, are defined on the gray shaded pooling regions.
Fig 11
Fig 11. Two bounding boxes (A, B) are shown on the visual field.
While box A covers a large portion of the pooling regions that it intersects with, box B’s coverage is not as good. Box B is discarded as it does not meet the overlap criteria (see text), therefore a component for B in the model is not created.

References

    1. Land MF. Oculomotor behaviour in vertebrates and invertebrates In: Liversedge SP, Gilchrist I, Everling S, editors. The Oxford Handbook of Eye Movements. Oxford University Press; 2011. p. 3–16.
    1. Marshall NJ, Land MF, Cronin TW. Shrimps that pay attention: saccadic eye movements in stomatopod crustaceans. Philosophical Transactions of the Royal Society of London B: Biological Sciences. 2014;369 (1636). 10.1098/rstb.2013.0042 - DOI - PMC - PubMed
    1. Curcio CA, Sloan KR, Kalina RE, Hendrickson AE. Human photoreceptor topography. The Journal of Comparative Neurology. 1990;292(4):497–523. 10.1002/cne.902920402 - DOI - PubMed
    1. Azzopardi P, Cowey A. Preferential representation of the fovea in the primary visual cortex. Nature. 1993;361:719–721. 10.1038/361719a0 - DOI - PubMed
    1. Itti L, Koch C. Computational modelling of visual attention. Nature reviews neuroscience. 2001;2(3):194–203. 10.1038/35058500 - DOI - PubMed