Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

Chern, I-Chun; Hung, Kuo-Hsuan; Chen, Yi-Ting; Hussain, Tassadaq; Gogate, Mandar; Hussain, Amir; Tsao, Yu; Hou, Jen-Cheng

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2210.17456 (eess)

[Submitted on 31 Oct 2022 (v1), last revised 1 Jun 2023 (this version, v3)]

Title:Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

Authors:I-Chun Chern, Kuo-Hsuan Hung, Yi-Ting Chen, Tassadaq Hussain, Mandar Gogate, Amir Hussain, Yu Tsao, Jen-Cheng Hou

View PDF

Abstract:AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

Comments:	ICASSP AMHAT 2023
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2210.17456 [eess.AS]
	(or arXiv:2210.17456v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2210.17456

Submission history

From: I-Chun Chern [view email]
[v1] Mon, 31 Oct 2022 16:30:10 UTC (2,840 KB)
[v2] Sat, 27 May 2023 01:53:54 UTC (6,213 KB)
[v3] Thu, 1 Jun 2023 03:43:34 UTC (6,214 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators