CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Kottur, Satwik; Moura, José M. F.; Parikh, Devi; Batra, Dhruv; Rohrbach, Marcus

Computer Science > Computer Vision and Pattern Recognition

arXiv:1903.03166 (cs)

[Submitted on 7 Mar 2019 (v1), last revised 18 Sep 2019 (this version, v2)]

Title:CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Authors:Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach

View PDF

Abstract:Visual Dialog is a multimodal task of answering a sequence of questions grounded in an image, using the conversation history as context. It entails challenges in vision, language, reasoning, and grounding. However, studying these subtasks in isolation on large, real datasets is infeasible as it requires prohibitively-expensive complete annotation of the 'state' of all images and dialogs.
We develop CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs.
We use CLEVR-Dialog to benchmark performance of standard visual dialog models; in particular, on visual coreference resolution (as a function of the coreference distance). This is the first analysis of its kind for visual dialog models that was not possible without this dataset. We hope the findings from CLEVR-Dialog will help inform the development of future models for visual dialog. Our dataset and code are publicly available.

Comments:	13 pages, 11 figures, 3 tables, accepted as a short paper at NAACL 2019
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:1903.03166 [cs.CV]
	(or arXiv:1903.03166v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1903.03166

Submission history

From: Satwik Kottur [view email]
[v1] Thu, 7 Mar 2019 20:18:39 UTC (8,386 KB)
[v2] Wed, 18 Sep 2019 18:04:43 UTC (6,758 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators