LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Sun, Siqi; Chen, Yen-Chun; Li, Linjie; Wang, Shuohang; Fang, Yuwei; Liu, Jingjing

Computer Science > Computation and Language

arXiv:2103.08784 (cs)

[Submitted on 16 Mar 2021 (v1), last revised 11 Apr 2021 (this version, v2)]

Title:LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Authors:Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, Jingjing Liu

View PDF

Abstract:Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, LightningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000x magnitude of computational hours. Code and pre-training checkpoints are available at this https URL.

Comments:	NAACL 2021
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2103.08784 [cs.CL]
	(or arXiv:2103.08784v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2103.08784

Submission history

From: Yen-Chun Chen [view email]
[v1] Tue, 16 Mar 2021 00:35:28 UTC (30,716 KB)
[v2] Sun, 11 Apr 2021 21:53:08 UTC (30,840 KB)

Computer Science > Computation and Language

Title:LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators