Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 14;21(24):8356.
doi: 10.3390/s21248356.

AttendAffectNet-Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Affiliations

AttendAffectNet-Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention

Ha Thi Phuong Thao et al. Sensors (Basel). .

Abstract

In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture-namely AttendAffectNet (AAN)-uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.

Keywords: COGNIMUSE; MediaEval 2016; affective computing; computer vision; emotion prediction; multimodal fusion; neural networks; self-attention.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Overview of the proposed AttendAffectNet (ANN). Feature vectors are extracted from video, audio, and movie subtitles. We reduce their dimensionality before feeding them to the self-attention based models, which predict the affective responses of movie viewers.
Figure 2
Figure 2
Our proposed Feature AttendAffectNet. For dimension reduction, the set of feature vectors V is fed to fully connected layers with eight neurons each (so as to obtain a set of dimension-reduced feature vectors V^) before being passed through N identical layers (each layer includes a multi-head self-attention accompanied with a feed-forward layer). The output of such stack is a set of encoded feature vectors Ṽ, which are then fed to an average pooling layer, dropout, and a fully connected layer (consisting of one neuron) to obtain the predicted arousal/valence values.
Figure 3
Figure 3
Our proposed Temporal AttendAffectNet: Feature vectors extracted from each movie part are passed to fully connected layers for dimension reduction before being combined together to create a representation vector (for each movie part). A positional encoding vector is added to this representation vector, which is then passed to N identical layers (each of them includes a mask multi-head attention, a multi-head attention accompanied with a feed-forward layer) followed by dropout and a fully connected layer consisting of only one neuron. We also add the positional encoding vectors to the previous outputs before using them as an additional input to the model to predict the subsequent output.
Figure 4
Figure 4
Our proposed Mixed AttendAffectNet: Feature vectors extracted from each movie part are first fed to fully connected layers for dimension reduction before being passed to N identical layers. Each of them includes a muli-head attention followed by a feed-forward layer. We apply average pooling to the outputs of those identical layers to obtain representation vectors corresponding to movie parts. We add positional encodings to these representation vectors, which are then fed to another set of N identical layers. These layers are similar to the previous ones, except that each of them includes one more layer called masked multi-head attention. This set of N identical layers is followed by dropout, and a fully connected layer. The previous outputs together with their corresponding positional encodings are used as the additional input to this model.
Figure 5
Figure 5
Both the ground truth and the predicted outputs of the Feature AAN model for the “Shakespeare in Love” movie clip are visualized: (a) for arousal and (b) for valence. Each time segment in the graphs corresponds to 5 s, which is also the length of each movie part.
Figure 6
Figure 6
Both the ground truth and the predicted outputs of the Feature AAN model for the “Ratatouille” movie clip are visualized: (a) for arousal and (b) for valence. Each time segment in the graphs corresponds to 5 s, which is also the duration of each movie part.

References

    1. Chambel T., Oliveira E., Martins P. Being happy, healthy and whole watching movies that affect our emotions; Proceedings of the International Conference on Affective Computing and Intelligent Interaction; Memphis, TN, USA. 9–12 October 2011; Berlin/Heidelberg, Germany: Springer; 2011. pp. 35–45.
    1. Gross J.J., Levenson R.W. Emotion elicitation using films. Cogn. Emot. 1995;9:87–108. doi: 10.1080/02699939508408966. - DOI
    1. Bartsch A., Appel M., Storch D. Predicting emotions and meta-emotions at the movies: The role of the need for affect in audiences’ experience of horror and drama. Commun. Res. 2010;37:167–190. doi: 10.1177/0093650209356441. - DOI
    1. Visch V.T., Tan E.S., Molenaar D. The emotional and cognitive effect of immersion in film viewing. Cogn. Emot. 2010;24:1439–1445. doi: 10.1080/02699930903498186. - DOI
    1. Fernández-Aguilar L., Navarro-Bravo B., Ricarte J., Ros L., Latorre J.M. How effective are films in inducing positive and negative emotional states? A meta-analysis. PLoS ONE. 2019;14:e0225040. doi: 10.1371/journal.pone.0225040. - DOI - PMC - PubMed

LinkOut - more resources