An Entanglement-driven Fusion Neural Network for Video Sentiment Analysis

An Entanglement-driven Fusion Neural Network for Video Sentiment Analysis

Dimitris Gkoumas, Qiuchi Li, Yijun Yu, Dawei Song

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 1736-1742. https://doi.org/10.24963/ijcai.2021/239

Video data is multimodal in its nature, where an utterance can involve linguistic, visual and acoustic information. Therefore, a key challenge for video sentiment analysis is how to combine different modalities for sentiment recognition effectively. The latest neural network approaches achieve state-of-the-art performance, but they neglect to a large degree of how humans understand and reason about sentiment states. By contrast, recent advances in quantum probabilistic neural models have achieved comparable performance to the state-of-the-art, yet with better transparency and increased level of interpretability. However, the existing quantum-inspired models treat quantum states as either a classical mixture or as a separable tensor product across modalities, without triggering their interactions in a way that they are correlated or non-separable (i.e., entangled). This means that the current models have not fully exploited the expressive power of quantum probabilities. To fill this gap, we propose a transparent quantum probabilistic neural model. The model induces different modalities to interact in such a way that they may not be separable, encoding crossmodal information in the form of non-classical correlations. Comprehensive evaluation on two benchmarking datasets for video sentiment analysis shows that the model achieves significant performance improvement. We also show that the degree of non-separability between modalities optimizes the post-hoc interpretability.
Keywords:
Humans and AI: Cognitive Modeling
Natural Language Processing: Embeddings