Are Large Kernels Better Teachers than Transformers for ConvNets?

Huang, Tianjin; Yin, Lu; Zhang, Zhenyu; Shen, Li; Fang, Meng; Pechenizkiy, Mykola; Wang, Zhangyang; Liu, Shiwei

Abstract:This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small-kernel ConvNets. While Transformers have led state-of-the-art (SOTA) performance in various fields with ever-larger models and labeled data, small-kernel ConvNets are considered more suitable for resource-limited applications due to the efficient convolution operation and compact weight sharing. KD is widely used to boost the performance of small-kernel ConvNets. However, previous research shows that it is not quite effective to distill knowledge (e.g., global information) from Transformers to small-kernel ConvNets, presumably due to their disparate architectures. We hereby carry out a first-of-its-kind study unveiling that modern large-kernel ConvNets, a compelling competitor to Vision Transformers, are remarkably more effective teachers for small-kernel ConvNets, due to more similar architectures. Our findings are backed up by extensive experiments on both logit-level and feature-level KD ``out of the box", with no dedicated architectural nor training recipe modifications. Notably, we obtain the \textbf{best-ever pure ConvNet} under 30M parameters with \textbf{83.1\%} top-1 accuracy on ImageNet, outperforming current SOTA methods including ConvNeXt V2 and Swin V2. We also find that beneficial characteristics of large-kernel ConvNets, e.g., larger effective receptive fields, can be seamlessly transferred to students through this large-to-small kernel distillation. Code is available at: \url{this https URL}.

Comments:	Accepted by ICML 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2305.19412 [cs.CV]
	(or arXiv:2305.19412v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.19412
Journal reference:	ICML 2023

Computer Science > Computer Vision and Pattern Recognition

Title:Are Large Kernels Better Teachers than Transformers for ConvNets?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators