ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

Zeng, Zequn; Zhang, Hao; Wang, Zhengjue; Lu, Ruiying; Wang, Dongsheng; Chen, Bo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2303.02437 (cs)

[Submitted on 4 Mar 2023 (v1), last revised 9 Mar 2023 (this version, v2)]

Title:ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

Authors:Zequn Zeng, Hao Zhang, Zhengjue Wang, Ruiying Lu, Dongsheng Wang, Bo Chen

View PDF

Abstract:Zero-shot capability has been considered as a new revolution of deep learning, letting machines work on tasks without curated training data. As a good start and the only existing outcome of zero-shot image captioning (IC), ZeroCap abandons supervised training and sequentially searches every word in the caption using the knowledge of large-scale pretrained models. Though effective, its autoregressive generation and gradient-directed searching mechanism limit the diversity of captions and inference speed, respectively. Moreover, ZeroCap does not consider the controllability issue of zero-shot IC. To move forward, we propose a framework for Controllable Zero-shot IC, named ConZIC. The core of ConZIC is a novel sampling-based non-autoregressive language model named GibbsBERT, which can generate and continuously polish every word. Extensive quantitative and qualitative results demonstrate the superior performance of our proposed ConZIC for both zero-shot IC and controllable zero-shot IC. Especially, ConZIC achieves about 5x faster generation speed than ZeroCap, and about 1.5x higher diversity scores, with accurate generation given different control signals.

Comments:	Accepted by CVPR2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2303.02437 [cs.CV]
	(or arXiv:2303.02437v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2303.02437

Submission history

From: Zequn Zeng [view email]
[v1] Sat, 4 Mar 2023 14:59:25 UTC (15,151 KB)
[v2] Thu, 9 Mar 2023 13:03:37 UTC (15,152 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators