Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Daxberger, Erik; Weers, Floris; Zhang, Bowen; Gunter, Tom; Pang, Ruoming; Eichner, Marcin; Emmersberger, Michael; Yang, Yinfei; Toshev, Alexander; Du, Xianzhi

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.04354 (cs)

[Submitted on 8 Sep 2023]

Title:Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Authors:Erik Daxberger, Floris Weers, Bowen Zhang, Tom Gunter, Ruoming Pang, Marcin Eichner, Michael Emmersberger, Yinfei Yang, Alexander Toshev, Xianzhi Du

View PDF

Abstract:Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in tremendous successes across domains such as natural language processing and computer vision. In this work, we instead explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications. To this end, we propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts. We also propose a stable MoE training procedure that uses super-class information to guide the router. We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs. For example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only 54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2309.04354 [cs.CV]
	(or arXiv:2309.04354v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.04354

Submission history

From: Erik Daxberger [view email]
[v1] Fri, 8 Sep 2023 14:24:10 UTC (1,222 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators