Lenna: Language Enhanced Reasoning Detection Assistant

Wei, Fei; Zhang, Xinyu; Zhang, Ailing; Zhang, Bo; Chu, Xiangxiang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2312.02433 (cs)

[Submitted on 5 Dec 2023]

Title:Lenna: Language Enhanced Reasoning Detection Assistant

Authors:Fei Wei, Xinyu Zhang, Ailing Zhang, Bo Zhang, Xiangxiang Chu

View PDF

Abstract:With the fast-paced development of multimodal large language models (MLLMs), we can now converse with AI systems in natural languages to understand images. However, the reasoning power and world knowledge embedded in the large language models have been much less investigated and exploited for image perception tasks. In this paper, we propose Lenna, a language-enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional <DET> token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. Remarkably, Lenna demonstrates outstanding performance on ReasonDet and comes with significantly low training costs. It also incurs minimal transferring overhead when extended to other tasks. Our code and model will be available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.02433 [cs.CV]
	(or arXiv:2312.02433v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.02433

Submission history

From: Bo Zhang [view email]
[v1] Tue, 5 Dec 2023 02:19:35 UTC (4,149 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Lenna: Language Enhanced Reasoning Detection Assistant

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Lenna: Language Enhanced Reasoning Detection Assistant

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators