VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Fan, Yue; Ma, Xiaojian; Wu, Rujie; Du, Yuntao; Li, Jiaqi; Gao, Zhi; Li, Qing

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.11481 (cs)

[Submitted on 18 Mar 2024 (v1), last revised 15 Jul 2024 (this version, v2)]

Title:VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Authors:Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, Qing Li

View PDF

Abstract:We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.

Comments:	ECCV-24; Project page: this http URL; First two authors contributed equally
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2403.11481 [cs.CV]
	(or arXiv:2403.11481v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.11481

Submission history

From: Xiaojian Ma [view email]
[v1] Mon, 18 Mar 2024 05:07:59 UTC (6,993 KB)
[v2] Mon, 15 Jul 2024 09:54:30 UTC (6,994 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators