Point and Ask: Incorporating Pointing into Visual Question Answering

Mani, Arjun; Yoo, Nobline; Hinthorn, Will; Russakovsky, Olga

Computer Science > Computer Vision and Pattern Recognition

arXiv:2011.13681 (cs)

[Submitted on 27 Nov 2020 (v1), last revised 18 Feb 2022 (this version, v4)]

Title:Point and Ask: Incorporating Pointing into Visual Question Answering

Authors:Arjun Mani, Nobline Yoo, Will Hinthorn, Olga Russakovsky

View PDF

Abstract:Visual Question Answering (VQA) has become one of the key benchmarks of visual recognition progress. Multiple VQA extensions have been explored to better simulate real-world settings: different question formulations, changing training and test distributions, conversational consistency in dialogues, and explanation-based answering. In this work, we further expand this space by considering visual questions that include a spatial point of reference. Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region.
Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges. There are two key distinctions from prior work. First, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the spatial reference. Second, we explicitly explore the more realistic point spatial input rather than the standard but unnatural bounding box input. Through our exploration we uncover and address several visual recognition challenges, including the ability to infer human intent, reason both locally and globally about the image, and effectively combine visual, language and spatial inputs. Code is available at: this https URL .

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2011.13681 [cs.CV]
	(or arXiv:2011.13681v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2011.13681

Submission history

From: Arjun Mani [view email]
[v1] Fri, 27 Nov 2020 11:43:45 UTC (17,310 KB)
[v2] Wed, 16 Jun 2021 16:54:24 UTC (19,615 KB)
[v3] Thu, 17 Jun 2021 06:33:25 UTC (19,615 KB)
[v4] Fri, 18 Feb 2022 05:50:50 UTC (19,616 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Point and Ask: Incorporating Pointing into Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Point and Ask: Incorporating Pointing into Visual Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators