Cross-modal Supervision for Learning Active Speaker Detection in Video

Chakravarty, Punarjay; Tuytelaars, Tinne

Computer Science > Computer Vision and Pattern Recognition

arXiv:1603.08907 (cs)

[Submitted on 29 Mar 2016]

Title:Cross-modal Supervision for Learning Active Speaker Detection in Video

Authors:Punarjay Chakravarty, Tinne Tuytelaars

View PDF

Abstract:In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

Comments:	16 pages
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1603.08907 [cs.CV]
	(or arXiv:1603.08907v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1603.08907

Submission history

From: Punarjay Chakravarty [view email]
[v1] Tue, 29 Mar 2016 19:47:46 UTC (5,420 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2016-03

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

Punarjay Chakravarty
Tinne Tuytelaars

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-modal Supervision for Learning Active Speaker Detection in Video

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-modal Supervision for Learning Active Speaker Detection in Video

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators