Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 28;17(6):e0270667.
doi: 10.1371/journal.pone.0270667. eCollection 2022.

The contribution of object identity and configuration to scene representation in convolutional neural networks

Affiliations

The contribution of object identity and configuration to scene representation in convolutional neural networks

Kevin Tang et al. PLoS One. .

Abstract

Scene perception involves extracting the identities of the objects comprising a scene in conjunction with their configuration (the spatial layout of the objects in the scene). How object identity and configuration information is weighted during scene processing and how this weighting evolves over the course of scene processing however, is not fully understood. Recent developments in convolutional neural networks (CNNs) have demonstrated their aptitude at scene processing tasks and identified correlations between processing in CNNs and in the human brain. Here we examined four CNN architectures (Alexnet, Resnet18, Resnet50, Densenet161) and their sensitivity to changes in object and configuration information over the course of scene processing. Despite differences among the four CNN architectures, across all CNNs, we observed a common pattern in the CNN's response to object identity and configuration changes. Each CNN demonstrated greater sensitivity to configuration changes in early stages of processing and stronger sensitivity to object identity changes in later stages. This pattern persists regardless of the spatial structure present in the image background, the accuracy of the CNN in classifying the scene, and even the task used to train the CNN. Importantly, CNNs' sensitivity to a configuration change is not the same as their sensitivity to any type of position change, such as that induced by a uniform translation of the objects without a configuration change. These results provide one of the first documentations of how object identity and configuration information are weighted in CNNs during scene processing.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Image collections with different background manipulations.
(A) and (B) Two example image sets from the “Fixed Background” image collection. All image sets in this collection had an identical room background. Each image set consisted of four unique images, constructed from two sets of objects and two configurations. Two images within a set could express the same configuration but with different objects (see the two images in the red box), or have the same objects but different configurations (see the two images in the green box). (C) Two example images from the “No Background” image collection. All image sets in this collection were shown without a room background. (D) Two example images from the “Variable Background” image collection. Each image set in this collection had a unique room background.
Fig 2
Fig 2. An example image set showing object configuration change and spatial shift within the same set of images.
These images were modified from the “No Background” image set shown in Fig 1. Each image set contained eight different images varying in configuration, objects, and spatial shift. (A) The grouping of the images based on the direction of the spatial shift (left or right). In Left Shifted Set, all four images had the same amount of leftward spatial shift but varied in object identity and configuration (e.g., the two images in the red box varied in object identity and the two in the green box varied in configuration). The same applied to the images in Right Shifted Set. (B) The grouping of the images based on configuration. In Configuration Set 1, all four images had the same configuration but varied in object identity and spatial shift (e.g., the two images in the red box varied in object identity and the two in the blue box varied in spatial shift). The same applied to the images in Configuration Set 2.
Fig 3
Fig 3. Euclidean distance and index measures quantifying a scene-trained CNN’s absolute and relative sensitivity to object identity and configuration changes for the different image collections.
The Euclidean distance for “Configuration” was calculated as the distance between two images sharing the same objects but different configurations. The Euclidean distance for “Object Identity” was calculated as the distance between two images sharing the same configuration but different objects. The object dominance index measured the relative sensitivity to object identity and configuration changes, with negative values indicating greater sensitivity to configuration than identity changes and positive values the reverse. (A) Euclidean distances for “Fixed Background” image collection. (B) Euclidean distances for “No Background” image collection. (C) Euclidean distances for “Variable Background” image collection. (D) The object dominance indices for all three image collections. (E) The object dominance indices for “Variable Background” image collection separated by scene classification accuracy. High/Low/Full Group, indices for the top-half/bottom-half/full image sets. Error bars indicate 95% confidence intervals of the means. In (A), (B), and (C), asterisks indicate significance from pairwise comparisons between the Euclidean distance measures using two-tailed t-tests (all corrected for multiple comparisons). All values are significantly non-zero (ts > 10, ps < 001.) In (D) the asterisks indicate the significance levels of the differences of each measure at each sample CNN layer against zero using two-tailed t tests (all corrected for multiple comparisons). * p < .05, ** p < .01, *** p < .001. The plotting lines are slightly shifted horizontally with respect to each other to minimize overlap.
Fig 4
Fig 4. Euclidean distance and index measures examining the effect of training, with these two types of measures quantifying a CNN’s absolute and relative sensitivity to object identity and configuration changes, respectively.
The Euclidean distance for “Configuration” was calculated as the distance between two images sharing the same objects but different configurations. The Euclidean distance for “Object Identity” was calculated as the distance between two images sharing the same configuration but different objects. The object dominance index measured the relative sensitivity to object identity and configuration changes, with negative values indicating greater sensitivity to configuration than identity changes and positive values the reverse. (A) Euclidean distances for the scene-trained and object-trained CNNs. (B) Euclidean distances for the scene-trained and the untrained CNNs. (C) The object dominance indices for the different training regimes. (D) Pairwise correlations of the object dominance index curves across CNN layers for the three training regimes. The asterisks indicate the significance levels of the pairwise comparisons at each sample CNN layer using two-tailed t tests (all corrected for multiple comparisons). In (A) and (B), the top row is from comparing “Configuration” across the two types of training, and the bottom row is from comparing “Object Identity” across the two types of training. All values were significantly non-zero (ts > 10, ps < .001). In (C), the asterisks from the top to bottom rows indicate, respectively, differences between scene-trained and object trained CNNs, between object-trained and untrained CNNs, between scene-trained and untrained CNNs, between scene-trained CNNs and zero, between object-trained CNNs and zero, and between untrained CNNs and zero. Error bars indicate within-image sets 95% confidence intervals of the means. * p < .05, ** p < .01, *** p < .001. The plotting lines are slightly shifted horizontally with respect to each other to minimize overlap.
Fig 5
Fig 5. Euclidean distance and index measures comparing the effect of configuration change and translation.
These two measures were obtained from scene-trained networks and quantify a CNN’s absolute and relative sensitivity to object identity and configuration changes as well as left/right translation. (A) Euclidean distances for changes in configuration, translation and object identity. The Euclidean distance for “Configuration” was calculated as the distance between two images sharing the same objects and translation but different configurations. Euclidean distance for “translation” was calculated as the distance between two images sharing the same objects and configuration but different translation. The Euclidean distance for “Object Identity” was calculated as the distance between two images sharing the same configuration and translation but different objects. (B) The object dominance indices for configuration change and translation. The object-over-configuration indices measured the relative sensitivity to object identity and configuration changes, with negative values indicating greater sensitivity to configuration than identity changes and positive values the reverse. The object-over-translation indices measured the relative sensitivity to object identity and translation changes, with negative values indicating greater sensitivity to translation than identity changes and positive values the reverse. The asterisks indicate the significance levels of the pairwise comparisons made at each sampled layer using two-tailed t tests (all corrected for multiple comparisons). In (A), the asterisks from the top to bottom rows indicate, respectively, differences between “Configuration” and “Translation”, between “Configuration” and “Object Identity”, and between “Translation” and “Object Identity.” In (B), the asterisks from the top to bottom rows indicate, respectively, differences between the two index measures, differences between the object-over-configuration indices and zero, and differences between the object-over-translation indices and zero. Error bars indicate within-image sets 95% confidence intervals of the means. * p < .05, ** p < .01, *** p < .001. The plotting lines are slightly shifted horizontally to minimize overlap.

References

    1. Epstein R. and Kanwisher N., 1998. A cortical representation of the local visual environment. Nature, 392, 598–601. doi: 10.1038/33402 - DOI - PubMed
    1. O’Craven K.M. and Kanwisher N., 2000. Mental imagery of faces and places activates corresponding stimulus-specific brain regions. Journal of cognitive neuroscience, 12, 1013–1023. doi: 10.1162/08989290051137549 - DOI - PubMed
    1. Nakamura K., Kawashima R., Sato N., Nakamura A., Sugiura M., Kato T., et al. 2000. Functional delineation of the human occipito-temporal areas related to face and scene processing: a PET study. Brain, 123, 1903–1912. doi: 10.1093/brain/123.9.1903 - DOI - PubMed
    1. Epstein R.A. and Baker C.I., 2019. Scene perception in the human brain. Annual Review of Vision Science, 5, 373–397. doi: 10.1146/annurev-vision-091718-014809 - DOI - PMC - PubMed
    1. Dilkes D.D., Kamps F.S., and Persichetti A.S. 2022. Three cortical scene systems and their development. Trends in Cognitive Sciences, 26, 117–127. doi: 10.1016/j.tics.2021.11.002 - DOI - PMC - PubMed

Publication types