Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Zhao, Sicheng; Li, Yaxian; Yao, Xingxu; Nie, Weizhi; Xu, Pengfei; Yang, Jufeng; Keutzer, Kurt

Computer Science > Computer Vision and Pattern Recognition

arXiv:2009.05103 (cs)

[Submitted on 22 Aug 2020]

Title:Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Authors:Sicheng Zhao, Yaxian Li, Xingxu Yao, Weizhi Nie, Pengfei Xu, Jufeng Yang, Kurt Keutzer

View PDF

Abstract:Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.

Comments:	Accepted by ACM Multimedia 2020
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2009.05103 [cs.CV]
	(or arXiv:2009.05103v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2009.05103

Submission history

From: Sicheng Zhao [view email]
[v1] Sat, 22 Aug 2020 20:12:23 UTC (6,347 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators