Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 2;14(1):26382.
doi: 10.1038/s41598-024-76968-9.

Decoding viewer emotions in video ads

Affiliations

Decoding viewer emotions in video ads

Alexey Antonov et al. Sci Rep. .

Abstract

Understanding and predicting viewers' emotional responses to videos has emerged as a pivotal challenge due to its multifaceted applications in video indexing, summarization, personalized content recommendation, and effective advertisement design. A major roadblock in this domain has been the lack of expansive datasets with videos paired with viewer-reported emotional annotations. We address this challenge by employing a deep learning methodology trained on a dataset derived from the application of System1's proprietary methodologies on over 30,000 real video advertisements, each annotated by an average of 75 viewers. This equates to over 2.3 million emotional annotations across eight distinct categories: anger, contempt, disgust, fear, happiness, sadness, surprise, and neutral, coupled with the temporal onset of these emotions. Leveraging 5-second video clips, our approach aims to capture pronounced emotional responses. Our convolutional neural network, which integrates both video and audio data, predicts salient 5-second emotional clips with an average balanced accuracy of 43.6%, and shows particularly high performance for detecting happiness (55.8%) and sadness (60.2%). When applied to full advertisements, our model achieves a strong average AUC of 75% in determining emotional undertones. To facilitate further research, our trained networks are freely available upon request for research purposes. This work not only overcomes previous data limitations but also provides an accurate deep learning solution for video emotion understanding.

Keywords: Deep learning; Emotion prediction; Video analytics.

PubMed Disclaimer

Conflict of interest statement

S.S.K., W.H., and O.W. are employees of the company that provided the dataset for this study. Their roles were primarily focused on facilitating data access and providing input on data interpretation. Still, they did not directly involve the development of the models or influence of the study’s results.

Figures

Fig. 1
Fig. 1
Facial expressions used by System1 Group PLC’s FaceTrace method during the video annotation process. (Source: System1 Group PLC, reproduced with permission).
Fig. 2
Fig. 2
System1’s Test Your Ad: At each time point throughout a video clip, we can measure the proportion of viewers in the panel (approximately n=75) who self-declared experiencing one of the eight emotions. This example illustrates the changes in emotional profiles within a video.
Fig. 3
Fig. 3
TSAM model: the multi-modal CNN architecture takes as input a predefined number of video frames (video segments) and audio converted into mel-spectrograms (audio segments). The ResNet50 backbone is used to extract features from both video and audio segments. Features from video segments are shifted between each other at different blocks of ResNet50. The audio input is represented by the mel-spectrogram) and is processed by the same backbone without shifting. The extracted features are fused by averaging and mapped to the output classes using a fully connected layer.
Fig. 4
Fig. 4
System1’s Test Your Ad data: Average number of user clicks per emotion for every 30 seconds of video, adjusted to account for different video lengths.
Fig. 5
Fig. 5
System1’ Test Your Ad dataset: Distribution of 5-second clips based on the percentage of users expressing various emotions. The x-axis shows the response strength as the percentage of viewers feeling the emotion. The y-axis shows the percentage of clips evoking that response level. Clips in the top 0.5% of the distribution (highlighted red) were used to define the emotional jumps, which were labeled for classifier training and testing.
Fig. 6
Fig. 6
System1’s Test Your Ad data: distribution of number of emotion clicks per user per video, not adjusted for video duration.
Fig. 7
Fig. 7
ROC curves for predicting the presence of emotion jumps in full-length video ads using our best CNN model (16 frames, RGB + audio input, pretrained on INET21K).

References

    1. Baveye, Y., Chamaret, C., Dellandréa, E. & Chen, L. Affective video content analysis: A multidisciplinary insight. IEEE Trans. Affect. Comput.9, 396–409. 10.1109/TAFFC.2017.2661284 (2018). - DOI
    1. Ekman, P. Emotions revealed: recognizing faces and feelings to improve communication and emotional life (Times Books/Henry Holt and Co,) (2003).
    1. Posner, J., Russell, J. A. & Peterson, B. S. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev. Psychopathol.17, 715–734 (2005). - DOI - PMC - PubMed
    1. Oprea, S. et al. A review on deep learning techniques for video prediction. CoRR[SPACE]arXiv: 2004.05214 (2020). - PubMed
    1. Liu, W. et al. A survey of deep neural network architectures and their applications. Neurocomputing234, 11–26. 10.1016/j.neucom.2016.12.038 (2017). - DOI

LinkOut - more resources