BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

Chen, Jin; Ma, Kaijing; Huang, Haojian; Shen, Jiayu; Fang, Han; Zang, Xianghao; Ban, Chao; He, Zhongjiang; Sun, Hao; Kang, Yanmei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.02768 (cs)

[Submitted on 17 Sep 2024]

Title:BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

Authors:Jin Chen, Kaijing Ma, Haojian Huang, Jiayu Shen, Han Fang, Xianghao Zang, Chao Ban, Zhongjiang He, Hao Sun, Yanmei Kang

View PDF HTML (experimental)

Abstract:The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, and similar semantics can also be expressed through different text forms, leading to underutilization of video. To address this, we propose BoViLA, a self-training framework that augments question samples during training through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2410.02768 [cs.CV]
	(or arXiv:2410.02768v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.02768

Submission history

From: Jin Chen [view email]
[v1] Tue, 17 Sep 2024 05:17:37 UTC (706 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators