Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 3;21(5):9.
doi: 10.1167/jov.21.5.9.

A deep-learning framework for human perception of abstract art composition

Affiliations

A deep-learning framework for human perception of abstract art composition

Pierre Lelièvre et al. J Vis. .

Abstract

Artistic composition (the structural organization of pictorial elements) is often characterized by some basic rules and heuristics, but art history does not offer quantitative tools for segmenting individual elements, measuring their interactions and related operations. To discover whether a metric description of this kind is even possible, we exploit a deep-learning algorithm that attempts to capture the perceptual mechanism underlying composition in humans. We rely on a robust behavioral marker with known relevance to higher-level vision: orientation judgements, that is, telling whether a painting is hung "right-side up." Humans can perform this task, even for abstract paintings. To account for this finding, existing models rely on "meaningful" content or specific image statistics, often in accordance with explicit rules from art theory. Our approach does not commit to any such assumptions/schemes, yet it outperforms previous models and for a larger database, encompassing a wide range of painting styles. Moreover, our model correctly reproduces human performance across several measurements from a new web-based experiment designed to test whole paintings, as well as painting fragments matched to the receptive-field size of different depths in the model. By exploiting this approach, we show that our deep learning model captures relevant characteristics of human orientation perception across styles and granularities. Interestingly, the more abstract the painting, the more our model relies on extended spatial integration of cues, a property supported by deeper layers.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Gallery of genres and styles mentioned throughout the paper. Ordering is chronological. (Mona Lisa by Leonardo da Vinci (1503-1519), Still-Life with Drinking-Horn by Willem Kalf (1653), The Meeting (Bonjour Monsieur Courbet) by Gustave Courbet (1854), Argenteuil seen from the small arm of the Seine by Claude Monet (1872), Young Girls on the Edge of the Sea by Pierre Puvis de Chavannes (1879), The Scream by Edvard Munch (1893), Seated man with his arms crossed by Pablo Picasso (1915), Komposition VII by Wassily Kandinsky (1913), A Naturalist's Study by Pierre Roy (1928)).
Figure 2.
Figure 2.
Schematic architecture of the multilevel orientation classification model employed in this study. Each of five convolutional blocks is associated with a classifier (indicated by classifier-n with n = 1 to 5). The output dimensionality of each classifier is indicated by (x, x, 4), where x is the number of samples across each spatial dimension (see density of circle array within insets overlaying local filters onto painting), and 4 is the number of orientation labels {up,90,180,270}. The four values within [ ] show one example of the categorical distribution generated by the network for Komposition VIII by Wassily Kandinsky (1923). In the legend, k/s stand for kernel/stride size.
Figure 3.
Figure 3.
Effect of median filtering on network attention, visualized through guided error back-propagation. Error map is inverted and thresholded for legibility. Light gray indicates pixels where attention reaches at least 1% of its maximum (moderate attention); dark gray indicates pixels where it exceeds 10% (high attention). (a) shows original images used for training. (b) shows directed attention in the absence of median filtering applied to the borders, (c) in the presence of median filtering. Two examples by Paul Klee are shown: The Place of the Twins (1929) and After Annealing (1940).
Figure 4.
Figure 4.
Model performance on whole paintings grouped by genre (a) and style (b).
Figure 5.
Figure 5.
Network attention through guided error back-propagation (see Methods). (a) Five examples of original inputs for validation (Komposition VII by Wassily Kandinsky (1913), Still-Life with Drinking-Horn by Willem Kalf (1653), Argenteuil seen from the small arm of the Seine by Claude Monet (1872), The Meeting (Bonjour Monsieur Courbet) by Gustave Courbet (1854), Mona Lisa by Leonardo da Vinci (1503-1519)). (b) Error maps with inverted and thresholded intensity. Light gray indicates pixels where attention reaches at least 1% of its maximum (moderate attention); dark gray indicates pixels where it exceeds 10% (high attention). Numeric values report light and dark pixel percentages over the entire painting surface. (c) Average surface ratio of high attention, plotted separately for different genres.
Figure 6.
Figure 6.
Model performance across classifiers. Values are grouped by style (as in Figure 4b) and displayed separately for the five distinct classifiers. (b) plots values from (a) after rescaling between chance and maximum value for given style (corresponding to performance of classifier-5).
Figure 7.
Figure 7.
Predicted orientations from individual receptive field units within each classifier. Different classifiers (1–5) are plotted from left to right. Relative size of the four wedges within each circle reflects prediction strength across the four different orientations. Examples are shown for three paintings (dates given when known): Argenteuil seen from the small arm of the Seine by Claude Monet (1872), The Waterfall of Amida behind the Kiso Road by Katsushika Hokusai, After Annealing by Paul Klee (1940).
Figure 8.
Figure 8.
Redundancy between adjacent classifiers, grouped by style. This metric corresponds to rescaled cross-entropy between classifier distributions at level n and those at level n+1 (see Methods). Values are averaged across fragments. Along x axis, c.p. stands for ceiling performance.
Figure 9.
Figure 9.
Human versus model performance for whole paintings and fragments. In (a), model performance from classifier-5 is plotted alongside human performance on whole paintings (dark versus light bars, respectively), grouped by style. In (b-c), model performance from different classifiers (1–5) is plotted alongside human performance on image fragments, separately for abstract (b) and figurative styles (c).
Figure 10.
Figure 10.
Normalized frequency of incorrectly predicted orientations across classifiers for model, all styles (a) and abstract style (b); across fragment size for humans, abstract styles (c). eq. stands for equi-frequency. Examples are shown for three paintings: Argenteuil seen from the small arm of the Seine by Claude Monet (1872), Komposition VII by Wassily Kandinsky (1913), Komposition VIII by Wassily Kandinsky (1923).
Figure 11.
Figure 11.
Density distribution of joint orientation choices generated by model and humans for individual abstract paintings, computed separately for different fragment-size/classifier from small/early (a) to large/late (e). Diagonal values correspond with matching responses (humans and model generate the same response); the diagonal sum (indicated by large white digits) is therefore termed “mutual agreement.” Its value is z-scored against the null hypothesis of human/model independence of choices (see main text for clarification). Intensity of white digits and thickness of diagonal orange line scale with corresponding z score. Bottom-left value reports agreement on target orientation.
Figure 12.
Figure 12.
Comparison between our model and the results reported by Mather (2012). (a) The average human and model performance. The original article reports human mean performance per painting. This quantity is not directly comparable to top-1 accuracy of the model, because the latter does not reflect the level of uncertainty for each painting. We have therefore chosen to plot the raw prediction value for the correct orientation as the model metric to plot against human performance (b).
Figure 13.
Figure 13.
Painting-by-painting human agreement with network model (top), the artists who painted the images used in our study (middle), and other humans from our sample of participants (bottom). This analysis was restricted to abstract material.

References

    1. Arnheim, R. (2004). Art and Visual Perception – A Psychology of the Creative Eye (2nd edition, 50th Anniversary). Berkeley: University of California Press. (Original work published 1954).
    1. Chang, D. H., & Troje, N. F. (2009). Acceleration carries the local inversion effect in biological motion perception. Journal of Vision, 9(1), 1–17. - PubMed
    1. Cusack, J. P., Williams, J. H., & Neri, P. (2015). Action perception is intact in autism spectrum disorder. Journal of Neuroscience, 35(5), 1849–1857. - PMC - PubMed
    1. Cuzick, J. (1985). A wilcoxon-type test for trend. Statistics in Medicine, 4(1), 87–90, 10.1002/sim.4780040112. - DOI - PubMed
    1. Devue, C., & Barsics, C. (2016). Outlining face processing skills of portrait artists: Perceptual experience with faces predicts performance. Vision Research, 127, 92–103. - PubMed

Publication types

LinkOut - more resources