Abstract
Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds, or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method to be significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Availability of data and materials
The well-known CoCo dataset is available from https://cocodataset.org/. Our two industrial datasets can be freely downloaded from https://drive.google.com/drive/folders/1Mu9QdnM5WsLccFp0Ygf7ES7mLV-64wRL?usp=sharing. Our code and video demos are available at https://github.com/zhirui-gao/Deep-Template-Matching.
References
Hinterstoisser, S.; Cagniart, C.; Ilic, S.; Sturm, P.; Navab, N.; Fua, P.; Lepetit, V. Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 34, No. 5, 876–888, 2012.
Ballard, D. H. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition Vol. 13, No. 2, 111–122, 1981.
Muja, M.; Rusu, R. B.; Bradski, G.; Lowe, D. G. REIN - A fast, robust, scalable REcognition INfrastructure. In: Proceedings of the IEEE International Conference on Robotics and Automation, 2939–2946, 2011.
Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Fua, P.; Navab, N. Dominant orientation templates for realtime detection of texture-less objects. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2257–2264, 2010.
Cheng, J. X.; Wu, Y.; AbdAlmageed, W.; Natarajan, P. QATM: Quality-aware template matching for deep learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11545–11554, 2019.
Gao, B.; Spratling, M. W. Robust template matching via hierarchical convolutional features from a shape biased CNN. In: The International Conference on Image, Vision and Intelligent Systems. Lecture Notes in Electrical Engineering, Vol. 813. Yao, J.; Xiao, Y.; You, P.; Sun, G. Eds. Springer Singapore, 333–344, 2022.
Ren, Q.; Zheng, Y. B.; Sun, P.; Xu, W. Y.; Zhu, D.; Yang, D. X. A robust and accurate end-to-end template matching method based on the Siamese network. IEEE Geoscience and Remote Sensing Letters Vol. 19, Article No. 8015505, 2022.
Wu, Y.; Abd-Almageed, W.; Natarajan, P. Deep matching and validation network: An end-to-end solution to constrained image splicing localization and detection. In: Proceedings of the 25th ACM international conference on Multimedia, 1480–1502, 2017.
Rocco, I.; Arandjelovic, R.; Sivic, J. Convolutional neural network architecture for geometric matching. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 11, 2553–2567, 2019.
Efe, U.; Ince, K. G.; Aydin Alatan, A. DFM: A performance baseline for deep feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4279–4288, 2021.
Jiang, W.; Trulls, E.; Hosang, J.; Tagliasacchi, A.; Yi, K. M. COTR: Correspondence transformer for matching across images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6187–6197, 2021.
Sarlin, P. E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4937–4946, 2020.
Sun, J. M.; Shen, Z. H.; Wang, Y. A.; Bao, H. J.; Zhou, X. W. LoFTR: Detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8918–8927, 2021.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing System, 6000–6010, 2017.
Wu, K.; Peng, H. W.; Chen, M. H.; Fu, J. L.; Chao, H. Y. Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10013–10021, 2021.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.
Fischler, M.; Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM Vol. 24, No. 6, 381–395, 1981.
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning, 5156–5165, 2020.
Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. In: Computer Vision - ECCV 2006. Lecture Notes in Computer Science, Vol. 3951. Leonardis, A.; Bischof, H.; Pinz, A. Eds. Springer Berlin Heidelberg, 404–417, 2006.
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In: Proceedings of the International Conference on Computer Vision, 2564–2571, 2011.
Barath, D.; Matas, J.; Noskova, J. MAGSAC: Marginalizing sample consensus. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10189–10197, 2019.
Brachmann, E.; Rother, C. Neural-guided RANSAC: Learning where to sample model hypotheses. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 4321–4330, 2019.
Brachmann, E.; Krull, A.; Nowozin, S.; Shotton, J.; Michel, F.; Gumhold, S.; Rother, C. DSAC—Differentiable RANSAC for camera localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2492–2500, 2017.
Lucas, B. D.; Kanade, T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, 674–679, 1981.
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd edn. Cambridge, UK: Cambridge University Press, 2003.
Nguyen, T.; Chen, S. W.; Shivakumar, S. S.; Taylor, C. J.; Kumar, V. Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robotics and Automation Letters Vol. 3, No. 3, 2346–2353, 2018.
Zhang, J. R.; Wang, C.; Liu, S. C.; Jia, L. P.; Ye, N. J.; Wang, J.; Zhou, J.; Sun, J. Content-aware unsupervised deep homography estimation. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 653–669, 2020.
Koguciuk, D.; Arani, E.; Zonooz, B. Perceptual loss for robust unsupervised homography estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 4269–4278, 2021.
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 337–33712, 2018.
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable CNN for joint description and detection of local features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8084–8093, 2019.
Yi, K. M.; Trulls, E.; Lepetit, V.; Fua, P. LIFT: Learned invariant feature transform. In: Computer-Vision - ECCV 2016. Lecture Notes in Computer-Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 467–483, 2016.
Luo, Z. X.; Zhou, L.; Bai, X. Y.; Chen, H. K.; Zhang, J. H.; Yao, Y.; Li, S. W.; Fang, T.; Quan, L. ASLFeat: Learning local features of accurate shape and localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6588–6597, 2020.
Chen, H. K.; Luo, Z. X.; Zhang, J. H.; Zhou, L.; Bai, X. Y.; Hu, Z. Y.; Tai, C. L.; Quan, L. Learning to match features with seeded graph matching network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6281–6290, 2021.
Jiang, B.; Sun, P. F.; Luo, B. GLMNet: Graph learning-matching convolutional networks for feature matching. Pattern Recognition Vol. 121, 108167, 2022.
Shi, Y.; Cai, J. X.; Shavit, Y.; Mu, T. J.; Feng, W. S.; Zhang, K. ClusterGNN: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12507–12516, 2022.
Roessle, B.; Nießner, M. End2End multi-view feature matching with differentiable pose optimization. arXiv preprint arXiv:2205.01694, 2022.
Suwanwimolkul, S.; Komorita, S. Efficient linear attention for fast and accurate keypoint matching. In: Proceedings of the International Conference on Multimedia Retrieval, 330–341, 2022.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.
Kitaev, N.; Kaiser, L; Levskaya, A. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451,2020.
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Computing Surveys Vol. 55, No. 6, 109, 2022.
Lan, Y. Q.; Duan, Y.; Liu, C. Y.; Zhu, C. Y.; Xiong, Y. S.; Huang, H.; Xu, K. ARM3D: Attention-based relation module for indoor 3D object detection. Computational Visual Media Vol. 8, No. 3, 395–414, 2022.
Su, Z.; Liu, W. Z.; Yu, Z. T.; Hu, D. W.; Liao, Q.; Tian, Q.; Pietikainen, M.; Liu, L. Pixel difference networks for efficient edge detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5097–5107, 2021.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Jau, Y. Y.; Zhu, R.; Su, H.; Chandraker, M. Deep keypoint-based camera pose estimation with geometric constraints. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 4950–4957, 2020.
Wang, Q.; Zhou, X.; Hariharan, B.; Snavely, N. Learning feature descriptors using camera pose supervision. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 757–774, 2020.
Zhou, Q.; Agostinho, S.; Osep, A.; Leal-Taixe, L. Is geometry enough for matching in visual localization? In: Computer Vision - ECCV 2022. Lecture Notes in Computer Science, Vol. 13670. Avidan, S.; Brostow, G.; Cisse, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 407–425, 2022.
Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet+ +: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 5105–5114, 2017.
Li, Y.; Harada, T. Lepard: Learning partial point cloud matching in rigid and deformable scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5544–5554, 2022.
Su, J. L.; Lu, Y.; Pan, S. F.; Murtadha, A.; Wen, B.; Liu, Y. F. RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 2292–2300, 2013.
Rocco, I.; Cimpoi, M.; Arandjelovi? R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 1658–1669, 2018.
Tyszkiewicz, M. J.; Fua, P.; Trulls, E. DISK: Learning local features with policy gradient. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 14254–14265, 2020.
Barath, D.; Matas, J. Graph-cut ransac. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6733–6741, 2018.
Chum, O.; Matas, J.; Kittler, J. Locally optimized RANSAC. In: Pattern Recognition. Lecture Notes in Computer Science, Vol. 2781. Michaelis, B.; Krell, G. Eds. Springer Berlin Heidelberg, 236–243, 2003.
Leordeanu, M.; Hebert, M. A spectral technique for correspondence problems using pairwise constraints. In: Proceedings of the 10th IEEE International Conference on Computer Vision, 1482–1489, 2005.
Bai, X. Y.; Luo, Z. X.; Zhou, L.; Chen, H. K.; Li, L.; Hu, Z. Y.; Fu, H. B.; Tai, C. L. PointDSC: Robust point cloud registration using deep spatial consistency. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15854–15864, 2021.
Chen, Z.; Sun, K.; Yang, F.; Tao, W. B. SC2-PCR: A second order spatial compatibility for efficient and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13211–13221, 2022.
Quan, S. W.; Yang, J. Q. Compatibility-guided sampling consensus for 3-D point cloud registration. IEEE Transactions on Geoscience and Remote Sensing Vol. 58, No. 10, 7380–7392, 2020.
Yang, J. Q.; Xian, K.; Wang, P.; Zhang, Y. N. A performance evaluation of correspondence grouping methods for 3D rigid data matching. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 6, 1859–1874, 2021.
Qin, Z.; Yu, H.; Wang, C.; Guo, Y. L.; Peng, Y. X.; Xu, K. Geometric transformer for fast and robust point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11133–11142, 2022.
Mises, R. V.; Pollaczek-Geiringer, H. Praktische verfahren der gleichungsauflosung. ZAMM - Zeitschrift Für Angewandte Mathematik Und Mechanik Vol. 9, No. 1, 58–77, 1929.
Mok, T. C. W.; Chung, A. C. S. Affine medical image registration with coarse-to-fine vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20803–20812, 2022.
Parihar, U. S.; Gujarathi, A.; Mehta, K.; Tourani, S.; Garg, S.; Milford, M.; Krishna K. M. RoRD: Rotationrobust descriptors and orthographic views for local feature matching. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1593–1600, 2021.
Shen, X.; Darmon, F.; Efros, A. A.; Aubry, M. RANSAC-flow: Generic two-stage image alignment. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 618–637, 2020.
Truong, P.; Danelljan, M.; Timofte, R. GLU-net: Global-local universal network for dense flow and correspondences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6257–6267, 2020.
Lee, M. C. H.; Oktay, O.; Schuh, A.; Schaap, M.; Glocker, B. Image-and-spatial transformer networks for structure-guided image registration. In: Medical Image Computing and Computer Assisted Intervention - MICCAI2019. Lecture Notes in Computer Science, Vol. 11765. Shen, D., et al. Eds. Springer Cham, 337–345, 2019.
Shu, C.; Chen, X.; Xie, Q. W.; Han, H. An unsupervised network for fast microscopic image registration. In: Proceedings of the SPIE 10581, Medical Imaging 2018: Digital Pathology, 105811D, 2018.
Riba, E.; Mishkin, D.; Ponsa, D.; Rublee, E.; Bradski, G. Kornia: An open source differentiable computer vision library for PyTorch. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3663–3672, 2020.
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788, 2016.
Canny, J. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. PAMI-8, No. 6, 679–698, 1986.
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research Vol. 9, No. 86, 2579–2605, 2008.
Acknowledgements
We thank Lintao Zheng and Jun Li for their help with dataset preparation and discussions.
Funding
This work is supported in part by the National Key R&D Program of China (2018AAA0102200) and the National Natural Science Foundation of China (62002375, 62002376, 62325221, 62132021).
Author information
Authors and Affiliations
Contributions
Zhirui Gao: Methodology, Writing Draft, Visualization, Results Analysis; Renjiao Yi: Methodology, Supervision, Writing Draft, Results Analysis; Zheng Qin: Supervision, Results Analysis; Yunfan Ye: Supervision, Results Analysis; Chenyang Zhu: Methodology, Supervision; Kai Xu: Methodology, Supervision.
Corresponding author
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article. The author Kai Xu is the Area Executive Editor of this journal.
Additional information
Zhirui Gao received his B.E. degree in computer science and technology from the Chinese University of Geosciences, Wuhan, in 2021. He is now a master student at the National University of Defense Technology (NUDT). His research interests include image matching and 3D vision.
Renjiao Yi is an assistant professor in the School of Computing, NUDT. She is interested in 3D vision problems such as inverse rendering and image-based relighting.
Zheng Qin received his B.E. and M.E. degrees in computer science and technology from NUDT in 2016 and 2018, respectively, where he is currently pursuing a Ph.D. degree. His research interests focus on 3D vision, including point cloud registration, pose estimation, and 3D representation learning.
Yunfan Ye is a Ph.D. candidate in the School of Computing, NUDT. His research interests include computer vision and graphics.
Chenyang Zhu is an assistant professor in the School of Computing, NUDT. His current directions of interest include data-driven shape analysis and modeling, 3D vision, robot perception, and robot navigation.
Kai Xu is a professor in the School of Computing, NUDT, where he received his Ph.D. degree in 2011. He serves on the editorial board of ACM Transactions on Graphics, Computer Graphics Forum, Computers & Graphics, etc.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Gao, Z., Yi, R., Qin, Z. et al. Learning accurate template matching with differentiable coarse-to-fine correspondence refinement. Comp. Visual Media 10, 309–330 (2024). https://doi.org/10.1007/s41095-023-0333-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-023-0333-9