MOWA: Multiple-in-One Image Warping Model

Kang Liao, Zongsheng Yue, Zhonghua Wu, and Chen Change Loy K. Liao, Z. Yue, Z. Wu, and C. C. Loy are with the S-Lab, Nanyang Technological University (NTU), Singapore (e-mail: [email protected], [email protected], [email protected], and [email protected]).C. C. Loy is the corresponding author.
Abstract

While recent image warping approaches achieved remarkable success on existing benchmarks, they still require training separate models for each specific task and cannot generalize well to different camera models or customized manipulations. To address diverse types of warping in practice, we propose a Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level. To further enable dynamic task-aware image warping, we introduce a lightweight point-based classifier that predicts the task type, serving as prompts to modulate the feature maps for more accurate estimation. To our knowledge, this is the first work that solves multiple practical warping tasks in one single model. Extensive experiments demonstrate that our MOWA, which is trained on six tasks for multiple-in-one image warping, outperforms state-of-the-art task-specific models across most tasks. Moreover, MOWA also exhibits promising potential to generalize into unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code and more visual results can be found on the project page: https://kangliao929.github.io/projects/mowa/.

Index Terms:
Image Warping, Multiple-in-One Model, Prompt Learning.
Refer to caption
Figure 1: MOWA is devised to address a variety of practical image warping tasks within a single framework, particularly in computational photography, where six distinct types of distortions are considered in this study. It also demonstrates an ability to generalize to novel scenarios, as evidenced in both cross-domain (unfamiliar domains) and zero-shot (unseen tasks) evaluations. The approach notably identifies and uses region-level and pixel-level fields, highlighted by red boxes, to accurately warp input images.

I Introduction

Image warping is essential in the field of computational imaging and computer vision, serving as the foundation for numerous applications, including image rectification [1, 2, 3, 4], image rectangling [5, 6, 7, 8], camera calibration [9, 10, 11, 12, 13], and 3D reconstructions [14, 15, 16, 17], etc. Enabling the manipulation of image data through processes such as scaling, rotation, and sheering allows for the seamless integration of diverse visual elements and the correction of optical imperfections. Moreover, image warping is indispensable in developing augmented reality (AR) and virtual reality (VR) applications [18, 19, 20], where it helps create immersive and realistic environments by accurately mapping textures and images onto 3D models.

Considering different inputs derived from different camera models or manipulation spaces, recent works integrate specific prior knowledge into their models to address the corresponding image warping tasks [6, 5, 21, 4, 22, 23, 24, 25, 26]. While these single-task approaches achieve significant progress, we found they suffer from two main limitations: (i) the lack of generalization and flexibility, which restricts their real-world applications since users are required to manually identify each input type and apply the appropriate single-task model. This process is time-consuming and challenging for non-professional users to judge. (ii) the substantial storage requirements for multiple task-specific models, which is impractical for some resource-limited platforms. Thus, it is crucial to develop a holistic framework capable of efficiently warping images from various camera models or manipulation spaces. Furthermore, many image warping tasks typically involve shared processes, such as motion estimation and content-aware perception. This indicates the possibility of developing a unified framework that incorporates these common image techniques.

In this work, we propose a Multiple-in-One image WArping model (named MOWA) to address various tasks in practice, as shown in Fig. 1. Specifically, we consider six representative types in the field of computational photography, namely stitched images, rectified wide-angle images, unrolling shutter images, rotated images, fisheye images, and portrait photos, covering the mainstream practical image warping tasks.

Given the fact that learning different structures of motion is non-trivial in one model and motion representations differ significantly across various tasks, we propose to disentangle the motion estimation at both the region level and pixel level. In this hierarchical architecture, we first estimate the control points of the thin-plate spline (TPS) model [27] with increasing refined numbers, in which the feature maps are progressively warped and rectified. Such a representation excels in approximating complex motions at a region level and enables high flexibility to various motion structures. Subsequently, the warped feature maps are fed into the decoder to predict the residual pixel-level displacement, which further improves the warping results for each task, especially in the image boundaries and details.

To enable MOWA to explicitly discriminate diverse input types, a lightweight point-based classifier is devised. Adding an extra classification network based on the image features is a straightforward solution but brings high computation and storage costs. Noticing the motion structures in different warping tasks possess their specific distribution, we leverage the middle product of the image warping framework, i.e., region-level control points, to directly learn the task type. It achieves comparable performance while allowing significant parameter reduction compared to the image-based classifier since only a few 2D points are needed. Then, the task label predicted by this point-based classifier is used to modulate the feature maps in the decoder, dynamically boosting task-aware image warpings using a prompt learning module. Prompts are a set of learnable parameters that encapsulate essential discriminative information about different types of input, which empower a single model to efficiently traverse and harness its vast parameter space to accommodate various warping requirements.

In the experiments, we trained MOWA on six typical tasks for multiple-in-one image warping. Experimental results demonstrated that it outperforms state-of-the-art (SotA) task-specific models in most tasks, even with comparable parameters of the network. In addition, MOWA allows the ability to generalize to unseen scenes, as evidenced by cross-domain evaluation (unfamiliar domains) and zero-shot evaluation (unseen tasks), indicating its robustness and adaptability across various scenarios. Our contributions can be summarized as follows:

  • We propose MOWA, which is the first practical multiple-in-one image warping framework. This proposed model, despite with an affordable model size, still evidently outperforms most SotA methods.

  • We propose to mitigate the difficulty of multi-task learning by decoupling the motion estimation in both the region level and pixel level. Moreover, a prompt learning module, guided by a lightweight point-based classifier, is designed to facilitate task-aware image warpings.

  • We show that through multi-task learning, our framework develops a robust generalized warping strategy that gains improved performance across various tasks and even generalizes to unseen tasks.

The remainder of the paper is organized as follows: Section II reviews the related works. We then present the proposed multiple-in-one image warping framework in Section III. The experiments are provided in Section IV. Section V concludes this paper.

II Related Work

Image warping is the process of manipulating an image to change its shape or alignment. This transformation is achieved by applying a spatial mapping function to the coordinates of the original image, resulting in a new image with altered geometry. In computational photography, image warping is a key technique for enhancing and manipulating images beyond traditional photography limits. This technique enables the creation of panoramic images [28, 29, 30], the correction of lens distortions [31, 32, 9], and the synthesis of novel views [33, 34, 35], etc. In the past few decades, warping techniques have significantly contributed to the development of advanced imaging applications beyond those mentioned above, offering greater flexibility and creativity. For example, the image boundaries can be twisted by different manipulations, leading to visually unpleasant layouts and negative effects on downstream vision tasks. Nevertheless, in practical scenarios, most users favor rectangular boundaries due to their compatibility with standard display formats, facilitating ease of sharing, printing, and publication [5, 24]. Therefore, researchers have developed diverse image rectangling methods to warp the image boundaries to be straight [5, 6, 21, 7, 8, 36]. Most of them follow the principle of content-aware image warping to avoid the large distortion on the original distribution when rectangling the image. Besides, different motion representations are also exploited, such as the mesh [5, 6] and control points [21], to formulate the warping process.

Excluding the customized manipulations, some special camera models can introduce geometric distortion onto the captured images, e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., radial distortion, rolling shutter distortion, and perspective distortion. The images’ semantic features significantly disobey the real-world rules due to those distortions. To address this issue, there is an exploration of distortion correction approaches [31, 32, 37, 3, 38, 39, 40, 25, 41, 26, 4] aimed at warping the distorted input to a geometrically reasonable one. Particularly, regression-based methods [32, 40] learn the camera and distortion parameters from the input image and correct the distortion by simulating the imaging process of a predefined camera model. In contrast, reconstruction-based methods [31, 37, 3, 38, 25, 42] directly learn the pixel-wise displacement between the distorted image and its ground truth, facilitating the model-free correction and enabling the end-to-end training.

The above works achieve remarkable progress on various tasks, of which well-designed network architectures and tailored motion representations are studied. However, they need to train an individual model for each specific warping type and require prior knowledge of the camera model or customized manipulations. In this work, we propose a multiple-in-one framework to involve these typical and practical image warping tasks. We address the challenge of learning different motion structures within a single model by employing a coarse-to-fine approach, progressively adding more TPS points to accurately fit the expected geometric distribution. To compensate additional degrees of freedom for TPS, our method further learns the residual flow based on the warped feature map, allowing for tuning of image boundaries and details in the final results.

Refer to caption
Figure 2: Overview of the proposed multiple-in-one image warping model (MOWA). It begins by taking an image and a mask as input to estimate the TPS control points with progressively refined precision. During such a region-level motion estimation, feature maps are incrementally warped and rectified. These warped features are then passed to the decoder to predict residual pixel-level motion. To ensure task awareness and expandability, a lightweight point-based classifier and a prompt learning module are designed. During inference, MOWA supports image warping for any resolution by scaling the predicted TPS control points and residual flow.

III Multiple-in-One Warping Model

III-A Problem Definition

In this study, we consider six representative and practical image types in the field of computational photography, including stitched images, rectified wide-angle images, unrolling shutter images, rotated images, fisheye images, and portrait photos, covering the mainstream practical image warping tasks. These types are further classified into two groups. The first four types (stitched images, rectified wide-angle images, unrolling shutter images, and rotated images) struggle with irregular boundaries as the original images are manipulated by some customized operations, such as image stitching, distortion correction, and rotation. Therefore, image rectangling is proposed to reshape these irregular boundaries while keeping the distribution of content unchanged. The last two types (fisheye images and portrait photos) show inherently geometric distortions imaged by special camera models, such as radial distortion in fisheye images and perspective distortion in portrait photos. Correcting these distortions is crucial to scene understanding and aesthetic appreciation.

Figure 2 shows the overall framework of the proposed MOWA. It takes the image and mask as input and estimates the TPS control points with increasingly refined numbers. In this region-level motion estimation, the feature maps are progressively warped and rectified. Subsequently, the warped features are fed into the decoder to predict a residual pixel-level motion. To enable task-aware and expandable capabilities, a lightweight point-based classifier and prompt learning module are designed. We elaborate on the details of each module in the proposed multiple-in-one image warping framework as follows.

Refer to caption
Figure 3: Motion structures in different tasks possess their specific distribution, which potentially exists in a 2D point space. Discriminating these motion structures as a classification task can also help the image warping performance as exhibited in visual comparisons.

III-B Motion Estimation Module

Learning multiple warping types in one model is challenging since the network needs to balance different complexities of the multiple motion types in motion estimation. Furthermore, the model’s scalability would be restricted if the motion representation is hand-crafted for specific tasks. Hence, we propose a flexible and hierarchical architecture for general image warping in MOWA. As shown in Fig. 2, the motion estimation is disentangled at both the region level, where the number of TPS control points progressively increases, and the pixel level, where a residual map is predicted to further compensate the estimated TPS flow.

Region-Level Motion Estimation. The TPS transformation [27] stands out for its remarkable ability to model complex motions [43, 44, 45]. It is adept at performing image warping based on two sets of region-level control points, namely 𝑸=[𝒒1,𝒒2,,𝒒N]TN×2𝑸superscriptsubscript𝒒1subscript𝒒2subscript𝒒𝑁𝑇superscript𝑁2\bm{Q}=[\bm{q}_{1},\bm{q}_{2},\cdots,\bm{q}_{N}]^{T}\in\mathbb{R}^{N\times 2}bold_italic_Q = [ bold_italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_q start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 2 end_POSTSUPERSCRIPT for the source image and 𝑸=[𝒒1,𝒒2,,𝒒N]N×2superscript𝑸subscriptsuperscript𝒒1subscriptsuperscript𝒒2subscriptsuperscript𝒒𝑁superscript𝑁2\bm{Q}^{{}^{\prime}}=[\bm{q}^{\prime}_{1},\bm{q}^{\prime}_{2},\cdots,\bm{q}^{% \prime}_{N}]\in\mathbb{R}^{N\times 2}bold_italic_Q start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = [ bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 2 end_POSTSUPERSCRIPT for the target image. To minimize the distortion of the source and target images, an energy term is introduced to penalize the Euclidean distance between the transformed source points 𝒯(𝒒i)𝒯subscript𝒒𝑖\mathcal{T}(\bm{q}_{i})caligraphic_T ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and the target points 𝒒isubscriptsuperscript𝒒𝑖\bm{q}^{\prime}_{i}bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e., i=1N𝒯(𝒒i)𝒒i22superscriptsubscript𝑖1𝑁superscriptsubscriptdelimited-∥∥𝒯subscript𝒒𝑖subscriptsuperscript𝒒𝑖22\sum_{i=1}^{N}\lVert\mathcal{T}({\bm{q}_{i}})-{\bm{q}^{\prime}_{i}}\rVert_{2}^% {2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ caligraphic_T ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This penalty results in a spatial deformation function parameterized by the control points, effectively capturing the intricate deformations across the image and maintaining the overall structural integrity. Specifically, the derived spatial deformation function can be expressed as follows:

𝒯(𝒒)=𝑨[𝒒1]+i=1NU(𝒒i𝒒2)𝒘i,𝒯𝒒𝑨matrix𝒒1superscriptsubscript𝑖1𝑁𝑈subscriptdelimited-∥∥subscriptsuperscript𝒒𝑖𝒒2subscript𝒘𝑖\mathcal{T}(\bm{q})=\bm{A}\begin{bmatrix}\bm{q}\\ 1\end{bmatrix}+\sum_{i=1}^{N}U\left({\left\lVert\bm{q}^{\prime}_{i}-\bm{q}% \right\rVert}_{2}\right)\bm{w}_{i},caligraphic_T ( bold_italic_q ) = bold_italic_A [ start_ARG start_ROW start_CELL bold_italic_q end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_U ( ∥ bold_italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (1)

where 𝒒𝒒\bm{q}bold_italic_q represents a point located in the source image. 𝑨2×3𝑨superscript23\bm{A}\in\mathbb{R}^{2\times 3}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 3 end_POSTSUPERSCRIPT and 𝒘i2subscript𝒘𝑖superscript2\bm{w}_{i}\in\mathbb{R}^{2}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the transformation parameters, U()𝑈U(\cdot)italic_U ( ⋅ ) is a radial basis function to quantify the influence of the control point, more details can be found in literature [27]. Notably, this deformation function plays a key role in determining the deformation induced by each control point, thereby shaping the overall transformation.

Motion estimation acts as the fundamental stage in image warping, presenting particular challenges in the context of multi-in-one task learning. To enhance the capability of our model in motion estimation, we design a progressive motion estimation module. More specifically, this module cascades a sequence of TPS transformation heads that gradually increase the number of control points. The control points predicted by the preceding head are upsampled and integrated into the prediction of the next head. Subsequently, these control points are arranged to generate a mesh. Then we adopt the TPS transformation to warp this mesh, aiming to align it with the regular mesh defined on the ground truth image. In the implementation, considering the cascade of fully connected layers introduces significant computation and storage costs, we use one or two convolution layers to predict the control points after each TPS transformation head. The pipeline of cascaded TPS transformation heads can be expressed by:

𝒒(t)=h(t)[((t1),𝒒(t1))]+UP[𝒒(t1)],superscript𝒒𝑡superscript𝑡delimited-[]superscript𝑡1superscript𝒒𝑡1UPdelimited-[]superscript𝒒𝑡1\bm{q}^{(t)}=h^{(t)}[\mathcal{R}(\mathcal{F}^{(t-1)},\bm{q}^{(t-1)})]+\text{UP% }[\bm{q}^{(t-1)}],bold_italic_q start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT [ caligraphic_R ( caligraphic_F start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ) ] + UP [ bold_italic_q start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ] , (2)

where h(t)superscript𝑡h^{(t)}italic_h start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is the t𝑡titalic_t-th TPS transformation head, (t1)superscript𝑡1\mathcal{F}^{(t-1)}caligraphic_F start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT and 𝒒(t1)superscript𝒒𝑡1\bm{q}^{(t-1)}bold_italic_q start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT are the feature map and control points of the (t1)𝑡1(t-1)( italic_t - 1 )-th head, respectively. (,)\mathcal{R}(\cdot,\cdot)caligraphic_R ( ⋅ , ⋅ ) represents the warping operation for feature maps given control points, and UP[]UPdelimited-[]\text{UP}[\cdot]UP [ ⋅ ] is a customized upsampling layer for control points.

Pixel-Level Motion Estimation. While TPS transformation is flexible and adaptable to various tasks, it is limited in its ability to describe detailed motions due to its restricted degree of freedom. To alleviate this limitation, we further complement the region-level motion representation with a pixel-level residual flow. Specifically, we first rectify the feature map (T)superscript𝑇\mathcal{F}^{(T)}caligraphic_F start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT using the corresponding control points 𝒒(T)superscript𝒒𝑇\bm{q}^{(T)}bold_italic_q start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT in the last transformation head, and then feed the rectified feature map into a decoder network 𝒟[]𝒟delimited-[]\mathcal{D}[\cdot]caligraphic_D [ ⋅ ] to predict the desirable residual flow. Like common U-Net architectures, the shallow features in the encoder are transited into the decoder using skip-connection. To eliminate the blur effect by multiple warpings (interpolation operation involved), we densify the TPS control points to pixel level and couple it with the residual flow to directly warp the input image \mathcal{I}caligraphic_I. The final warping result superscript\mathcal{I}^{\prime}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be obtained by:

=𝒲(𝒟[((T),𝒒(T))]+DE[𝒒(T)],),superscript𝒲𝒟delimited-[]superscript𝑇superscript𝒒𝑇DEdelimited-[]superscript𝒒𝑇\mathcal{I}^{\prime}=\mathcal{W}\left(\mathcal{D}[\mathcal{R}(\mathcal{F}^{(T)% },\bm{q}^{(T)})]+\text{DE}[\bm{q}^{(T)}],\mathcal{I}\right),caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = caligraphic_W ( caligraphic_D [ caligraphic_R ( caligraphic_F start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT , bold_italic_q start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ) ] + DE [ bold_italic_q start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ] , caligraphic_I ) , (3)

where 𝒲(,)𝒲\mathcal{W}(\cdot,\cdot)caligraphic_W ( ⋅ , ⋅ ) denotes the warping operation given the flow map and input image, DE[]DEdelimited-[]\text{DE}[\cdot]DE [ ⋅ ] densifies the sparse control points to a dense flow map, which can be regarded as a special case of TPS upsampling layer UP[]UPdelimited-[]\text{UP}[\cdot]UP [ ⋅ ]. Unlike previous works tailored for specific warping tasks, our method unifies motion representation across various tasks at both the region and pixel levels. The experimental results are demonstrated in Section IV-C.

III-C Point-based Task Classifier

When learning various image warping tasks simultaneously, a task classifier is crucial for efficiently routing inputs to their respective task-specific components, optimizing resource use, and enhancing model performance. It is straightforward to design a task classifier in terms of the input image. However, such a design brings an unavoidable issue of high computation complexity due to the redundant image features. Instead, we propose a lightweight task classifier based on the TPS points predicted by the motion estimation module. Our motivation stems from the fact that the motion structures in different tasks possess their specific distribution, which potentially exists in a point space as shown in Fig. 3. To this end, we design a PointNet-like network [46, 47] to predict the task type. Specifically, as shown in Fig. 2, it takes the local coordinates of motion with the global image features (after maxpooling) by point-wise concatenation along the last dimension as input and outputs the soft task label 𝚽𝚽\bm{\Phi}bold_Φ. We can formulate this point-based task classifier as follows:

g=f[MaxPool()],subscript𝑔𝑓delimited-[]MaxPool\displaystyle\mathcal{F}_{g}=f\left[\text{MaxPool}(\mathcal{F})\right],\ \ \ % \ \ \ \ \ \ \ \ \ \ caligraphic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_f [ MaxPool ( caligraphic_F ) ] , (4)
𝚽=Softmax(f[𝒒(T)R(g,[1,H×W])]),𝚽Softmaxsuperscript𝑓delimited-[]direct-sumsuperscript𝒒𝑇Rsubscript𝑔1𝐻𝑊\displaystyle\bm{\Phi}=\text{Softmax}\left(f^{\prime}[\bm{q}^{(T)}\oplus\text{% R}(\mathcal{F}_{g},[1,H\times W])]\right),bold_Φ = Softmax ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ bold_italic_q start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ⊕ R ( caligraphic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , [ 1 , italic_H × italic_W ] ) ] ) , (5)

where f𝑓fitalic_f and fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the fully connected layers to decrease the dimensions of features and learn the abstract concepts. R denotes replicating gsubscript𝑔\mathcal{F}_{g}caligraphic_F start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to the same shape of the predicted motion coordinates 𝒒(T)superscript𝒒𝑇\bm{q}^{(T)}bold_italic_q start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT and direct-sum\oplus is the point-wise concatenation. Experiments demonstrate our point-based task classifier achieves comparable results while having less than ×50absent50\times 50× 50 parameters compared with the image-based classifier.

In addition to the task classification function, our point-based task classifier can further improve image warping performance. This improvement is due to the high-level guidance provided by the task classifier to the motion estimation module through gradient back-propagation. Figure 3 (right) depicts a typical example, in which the fisheye rectification result shows a less distorted shape, and the predicted control points of the stitched image are more tightly aligned to the image boundary. Compared to the vanilla baseline, the proposed point-based task classifier achieves an average improvement of +0.350.35+0.35+ 0.35 in PSNR metrics across various image warping tasks. More quantitative results are presented in Section IV-C.

III-D Prompt Learning Module

Once the inputs are classified by the proposed point-based network, we leverage the predicted task label to modulate the feature maps in the network. In particular, a prompt learning block is inserted into each layer in the decoder as a plug-and-play module. Prompt learning aims to tackle the challenge of generalizing in various image warping tasks by aiding the network in comprehending the specific task at hand. The prompts serve as a flexible and lightweight component to encode motion context across multiple scales within the image warping network.

Assuming the task number is N𝑁Nitalic_N, we introduce a set of learnable parameters as our prompts, namely {𝑷i}i=1Nsuperscriptsubscriptsubscript𝑷𝑖𝑖1𝑁\{\bm{P}_{i}\}_{i=1}^{N}{ bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. By denoting the predicted task label by the task classifier as 𝚽𝚽\bm{\Phi}bold_Φ, we modulate the feature maps \mathcal{F}caligraphic_F in the decoder network by the prompts as follows:

m=Conv1×1(i=1NΦi𝑷i),superscript𝑚subscriptConv11direct-sumsuperscriptsubscript𝑖1𝑁subscriptΦ𝑖subscript𝑷𝑖\mathcal{F}^{m}=\text{Conv}_{1\times 1}\big{(}\mathcal{F}\oplus\sum_{i=1}^{N}{% \Phi_{i}\bm{P}_{i}}\big{)},caligraphic_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( caligraphic_F ⊕ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (6)

where direct-sum\oplus represents the concat operation, Conv1×1subscriptConv11\text{Conv}_{1\times 1}Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT is a convolution layer with 1×1111\times 11 × 1 kernel size aiming to reduce the channel dimension of concatenated features.

By integrating the learnable prompts with the features of the warping model, we can significantly enrich the representations with task-specific knowledge. Unlike pre-defined and fixed prompts, our adaptive approach enables the network to dynamically influence its behavior, resulting in more efficient and precise image warping. This adaptive process not only enhances the flexibility of the model but also improves its ability to generalize across different tasks and datasets. More analysis on the multi-task learning and effectiveness of the proposed prompt learning are demonstrated in Section IV-D.

III-E Training Loss

After predicting the TPS control points and the residual flow, the warped image can be obtained by Eq. (3). Following previous works [6, 21], we first exploit three losses to train our multiple-in-one image warping framework, e.g., image reconstruction loss Recsubscript𝑅𝑒𝑐\mathcal{L}_{Rec}caligraphic_L start_POSTSUBSCRIPT italic_R italic_e italic_c end_POSTSUBSCRIPT, perceptual loss Persubscript𝑃𝑒𝑟\mathcal{L}_{Per}caligraphic_L start_POSTSUBSCRIPT italic_P italic_e italic_r end_POSTSUBSCRIPT, and inter-grid loss Gridsubscript𝐺𝑟𝑖𝑑\mathcal{L}_{Grid}caligraphic_L start_POSTSUBSCRIPT italic_G italic_r italic_i italic_d end_POSTSUBSCRIPT. The reconstruction loss and perceptual loss supervise the warped image at the pixel level and feature level, respectively. The inter-grid loss constrains the edges of two consecutive deformed grids {et1,et2}subscript𝑒𝑡1subscript𝑒𝑡2\{\vec{e}_{t1},\vec{e}_{t2}\}{ over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT } to be co-linear:

Grid=1M{et1,et2}m(1et1,et2et1et2).subscript𝐺𝑟𝑖𝑑1𝑀subscriptsubscript𝑒𝑡1subscript𝑒𝑡2𝑚1subscript𝑒𝑡1subscript𝑒𝑡2normsubscript𝑒𝑡1normsubscript𝑒𝑡2\mathcal{L}_{Grid}=\frac{1}{M}\sum_{\{\vec{e}_{t1},\vec{e}_{t2}\}\in m}(1-% \frac{\langle\vec{e}_{t1},\vec{e}_{t2}\rangle}{\parallel\vec{e}_{t1}\parallel% \cdot\parallel\vec{e}_{t2}\parallel}).caligraphic_L start_POSTSUBSCRIPT italic_G italic_r italic_i italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT { over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT } ∈ italic_m end_POSTSUBSCRIPT ( 1 - divide start_ARG ⟨ over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT , over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT ∥ ⋅ ∥ over→ start_ARG italic_e end_ARG start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT ∥ end_ARG ) . (7)

Here, M𝑀Mitalic_M represents the number of tuples of two successive edges in a mesh m𝑚mitalic_m. When maximizing the above cosine representation, the corresponding two edges become collinear. Consequently, the loss reaches its minimum, ensuring the image content remains consistent.

Considering the ground truth of warping flow is available in the training dataset of portrait photos, we also add the reconstruction loss Flowsubscript𝐹𝑙𝑜𝑤\mathcal{L}_{Flow}caligraphic_L start_POSTSUBSCRIPT italic_F italic_l italic_o italic_w end_POSTSUBSCRIPT on the predicted flow of the portrait correction task. Moreover, we provide middle-level supervision on the warped results from the TPS prediction heads with a set of exponentially growing weights. To train the point-based task classifier, the standard cross-entropy loss Clssubscript𝐶𝑙𝑠\mathcal{L}_{Cls}caligraphic_L start_POSTSUBSCRIPT italic_C italic_l italic_s end_POSTSUBSCRIPT is applied. Overall, the final loss can be expressed by:

=Rec+Per+Grid+λFlowFlowImageWarping+λClsClsTaskClassifier,subscriptsubscript𝑅𝑒𝑐subscript𝑃𝑒𝑟subscript𝐺𝑟𝑖𝑑subscript𝜆𝐹𝑙𝑜𝑤subscript𝐹𝑙𝑜𝑤𝐼𝑚𝑎𝑔𝑒𝑊𝑎𝑟𝑝𝑖𝑛𝑔subscriptsubscript𝜆𝐶𝑙𝑠subscript𝐶𝑙𝑠𝑇𝑎𝑠𝑘𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟\mathcal{L}=\underbrace{\mathcal{L}_{Rec}+\mathcal{L}_{Per}+\mathcal{L}_{Grid}% +\lambda_{Flow}\mathcal{L}_{Flow}}_{Image\ Warping}+\underbrace{\lambda_{Cls}% \mathcal{L}_{Cls}}_{Task\ Classifier},caligraphic_L = under⏟ start_ARG caligraphic_L start_POSTSUBSCRIPT italic_R italic_e italic_c end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_P italic_e italic_r end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_G italic_r italic_i italic_d end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_F italic_l italic_o italic_w end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_F italic_l italic_o italic_w end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_I italic_m italic_a italic_g italic_e italic_W italic_a italic_r italic_p italic_i italic_n italic_g end_POSTSUBSCRIPT + under⏟ start_ARG italic_λ start_POSTSUBSCRIPT italic_C italic_l italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_l italic_s end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT italic_T italic_a italic_s italic_k italic_C italic_l italic_a italic_s italic_s italic_i italic_f italic_i italic_e italic_r end_POSTSUBSCRIPT , (8)

where λFlowsubscript𝜆𝐹𝑙𝑜𝑤\lambda_{Flow}italic_λ start_POSTSUBSCRIPT italic_F italic_l italic_o italic_w end_POSTSUBSCRIPT and λClssubscript𝜆𝐶𝑙𝑠\lambda_{Cls}italic_λ start_POSTSUBSCRIPT italic_C italic_l italic_s end_POSTSUBSCRIPT are the hyper-parameters to balance different losses, both of them are empirically set to 0.10.10.10.1.

In summary, the proposed multiple-in-one image warping framework bring the following benefits.

  • Unlike previous task-specific image warping models, our method can recover various geometrically distortion images within a single network. It does not require prior knowledge of the camera models or manipulation spaces; it is also friendly to use and relies only on the observed input image to perform the customized image warping.

  • Our method provides greater flexibility and cost-effectiveness in real-world scenarios, unlike previous methods that need a proportionally larger model size as the number of warping tasks increases.

  • Thanks to multi-task learning, our method develops a generalized motion representation across various image warping tasks, demonstrating remarkable performance in cross-domain evaluations and unseen tasks.

IV Experiments

To demonstrate the effectiveness of the proposed multiple-in-one image warping method, we evaluate its performance on six representative distorted types, including stitched images, rectified wide-angle images, unrolling shutter images, rotated images, fisheye images, and portrait photos, covering the mainstream practical image warping tasks.

IV-A Experimental Settings

Implementation Details. We train the proposed model using the Adam optimizer with the momentum terms of (0.9,0.999)0.90.999(0.9,0.999)( 0.9 , 0.999 ) on 8 NVIDIA A100 GPUs. The learning rate starts with a linear warm-up in the first three epochs and then decays from 1e41superscript𝑒41e^{-4}1 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 1e61superscript𝑒61e^{-6}1 italic_e start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT following a cosine schedule in the remaining epochs. The batch size is set as 64. The complete framework is trained with a fixed input size of 256×256256256256\times 256256 × 256. At the first 10 epochs, we solely train and supervise the TPS prediction heads with the point-based task classifier. Afterwards, all modules are trained collectively. During inference, the proposed method supports image warping for any resolution by scaling the predicted TPS control points and residual flow.

Network Configuration. We design the image warping network based on the encoder-decoder architecture, enabling both region-level control point regression and pixel-level residual flow prediction. Specifically, the Transformer blocks with shifted windows [48, 49] are used in both the encoder and decoder except for the input projection layer and output projection layer. The basic dimension of channels is set to 32 and linearly increases along the layers in the encoder network, which is oppositely decreased to 2 in the decoder network. Moreover, the depths of each Transformer block are set to 2 and the head numbers of multi-head self-attention are [1,2,4,8,16,16,8,4,2]12481616842[1,2,4,8,16,16,8,4,2][ 1 , 2 , 4 , 8 , 16 , 16 , 8 , 4 , 2 ] along the whole layers. In TPS prediction heads, we adopt the convolution layers with different kernels to predict increasing numbers of control points, and the numbers are set to 10×10101010\times 1010 × 10, 12×12121212\times 1212 × 12, 14×14141414\times 1414 × 14, and 16×16161616\times 1616 × 16. The configuration details of these regression heads are listed in Table I. Such a design enables significant parameter reduction compared with the fully connected layers. For the lightweight point-based classifier, three 1D convolutional layers with channel dimensions of 256,256,512256256512256,256,512256 , 256 , 512 are used to extract the features of input and then three fully connected layers with unit numbers of 512,256,65122566512,256,6512 , 256 , 6 are used to classify their task types.

TABLE I: The configurations of convolutional layers to regress different sizes of control points (the size of input feature maps is 16×16161616\times 1616 × 16).
Configuration 8×8888\times 88 × 8 10×10101010\times 1010 × 10 12×12121212\times 1212 × 12 14×14141414\times 1414 × 14 16×16161616\times 1616 × 16
Kernel Size 3×3333\times 33 × 3 {5×5555\times 55 × 5, 3×3333\times 33 × 3} 5×5555\times 55 × 5 3×3333\times 33 × 3 3×3333\times 33 × 3
Stride 2 {1, 1} 1 1 1
Padding 1 {0, 0} 0 0 1
Refer to caption
Figure 4: Qualitative comparison of our multiple-in-one framework MOWA to the SotA image warping models. The red dotted lines mark the horizon. The arrows highlight the inferior warped parts such as the irregular boundaries and distorted semantics.
TABLE II: Quantitative evaluation of the proposed multiple-in-one framework to the SotA image warping models.
Warping Tasks Metrics
Input Type Methods PSNR \uparrow SSIM \uparrow ShapeAcc \uparrow Parameter
Rotated Image He et al. [5] 17.63 0.4880 - -
Deep_Rect [6] 19.89 0.5500 - 52.14M
Ours 21.01 0.5961 - 49.93M
Rectified Wide-Angle Image He et al. [5] 15.36 0.4211 - -
RecRecNet [21] 18.68 0.5450 - 62.70M
Ours 18.69 0.5450 - 49.93M
Stitched Image He et al. [5] 14.70 0.3775 - -
Deep_Rect [6] 21.28 0.7140 - 52.14M
Ours 20.72 0.6425 - 49.93M
Unrolling Shutter Image RecRecNet [21] 21.48 0.7602 - 62.70M
Ours 21.69 0.7795 - 49.93M
Fisheye Image PCN [37] 21.37 0.6925 - 26.19M
Feng et al. [3] 21.72 0.7167 - 11.65M
Ours 22.25 0.7488 - 49.93M
Portrait Photos Shih et al. [41] - - 97.253 -
Tan et al. [25] - - 97.490 -
Zhu et al. [26] - - 97.491 8.79M
Ours - - 97.477 49.93M

Datasets. We use the public benchmarks from recent SotA works, including the image rectangling datasets [6, 50, 21] and the distortion correction datasets [37, 25, 26]. Since there is no available training dataset for unrolling shutter image rectangling, we use the rolling shutter correction dataset [38] and follow the standard data construction process from previous methods [6] to synthesize the paired data. This dataset will also be made public.

Metrics. Following previous works, we select PSNR and SSIM as metrics to quantitatively measure the quality of the warped results. Please note that it is challenging to use the Average Endpoint Error (EPE) metric to evaluate the motion estimation performance in practical image warping tasks, because the accurate labels of motion are hard to obtain and unavailable in all the above test datasets. As a consequence, most previous methods have opted to learn the motion in an unsupervised manner and supervise the image warping model at the warped pixel level.

For the portrait correction task, the ShapeAcc metric is applied as suggested in Tan et al. [25]. It is specially designed for the quality of face correction, which calculates the similarity between corrected portraits and the stereographic projection of its original input.

IV-B Comparison Results

We compare the proposed MOWA with recent SotA methods on each task, including Deep_Rect [6], He et al. [5], RecRecNet [21], PCN [37], Feng et al. [3], Shih et al. [41], Tan et al. [25], and Zhu et al. [26].

Qualitative Comparison. As shown in Fig. 4, we visualize the comparison results of different methods on the testing datasets. These qualitative results demonstrate that our multiple-in-one method can handle various tasks, scenes, and resolutions well, compared with the SotA methods specially designed for each task. For example, for the rotated images, our method can rearrange the input to a rectangle one while keeping the original geometric layout reasonable. On the contrary, distorted buildings can be observed in the results of previous works [5, 6], in which the physical world rules such as the horizon are perturbed. For other rectangling tasks like the stitched image, unrolling shutter image, and rectified wide-angle image, our method shows a better visual appearance, especially in the image boundaries, allowing promising structural integrity among the comparison methods. In some challenging cases, such as the first and second rows in stitched images, the image boundaries are dramatically stretched, but our method can still warp the images to expected structures. One important reason is that MOWA learns the generalized warping strategy from different tasks since it can extract some common knowledge from them. In addition, our method mitigates the difficulty of motion estimation by disentangling it at both the region level and pixel level. Consequently, diverse structures of motions can be progressively approximated and the image details can be preserved. For the fisheye image and portrait photos, MOWA is capable of recovering the realistic geometric distribution from the inputs, despite the radial distortion or perspective distortion. Please refer to more visual comparison images, interactive warping visualizations, and dynamic warping results on the project page: https://kangliao929.github.io/projects/mowa/.

Quantitative Comparison. We report the quantitative evaluation results in Table II. The proposed multiple-in-one image warping jointly learns six tasks and achieves promising performance compared with the single-task methods. For example, MOWA outperforms the SotA methods in rectified wide-angle images, unrolling shutter images, rotated images, and fisheye images, thanks to the elaborately designed hierarchical motion estimation architecture and task-aware prompt learning strategy. Moreover, MOWA achieves comparable image warping performance for stitched images and portrait photos without intolerable performance degradation when involving more tasks and data. The results suggest the generalizability and flexibility of MOWA, which are not achievable by previous methods [21, 5, 3, 6, 25] that tailor the specific knowledge into their models to address the single image warping task.

Computation Complexity Comparison. In Table II, we also compare the computation complexity of the proposed method with previous methods that make their models available. The comparison suggests that our model size is reasonable and affordable as a multiple-in-one image warping framework. Even compared to the SotA models designed for the specific task [6, 21], our MOWA has fewer parameters to achieve better or comparable warping performance. The underlying reason is that the shared knowledge across different tasks can relieve the burden of the parameter requirements of a multi-task model. Besides, the proposed motion estimation module discards the heavy fully connected layers and replaces them with convolutional layers. Then, the predicted region-level TPS points are further compensated with the pixel-level displacement from a compact convolutional decoder.

Refer to caption
Figure 5: Ablation study on the proposed motion estimation module. The predicted TPS control points are shown with the size of 10×10101010\times 1010 × 10, 12×12121212\times 1212 × 12, 14×14141414\times 1414 × 14, and 16×16161616\times 1616 × 16, from the left to right. The coarse results and final results are obtained by warping the input using the first control points and final flow (coupled with the last TPS points and residual flow), respectively.

IV-C Ablation Study

Considering the aim of a multiple-in-one framework is to achieve holistic performance across various tasks, we mainly compare the different variants of the framework in terms of the average warping metrics. Additionally, the same image quality metrics (PSNR and SSIM) are shared in the first five tasks, but the portrait correction task has its own metrics like ShapeAcc. Thus, the average PSNR and SSIM from the first five tasks are mainly reported in this part.

TABLE III: Ablation study on the proposed motion estimation module. The baseline represents the predicted control points with a size of 12×12121212\times 1212 × 12. “10-10-10-10” means 4 heads are applied and each head predicts the control points with a size of 10×10101010\times 1010 × 10. Other settings are also presented in this form. “Ours” denotes the combination of “10-12-14-16” and residual flow.
Metrics Baseline 12-12-12-12 14-14-14-14 16-16-16-16 10-12-14-16 Ours
PSNR 20.29 20.38 20.42 20.02 20.48 20.84
SSIM 0.6279 0.6311 0.6406 0.6148 0.6418 0.6572

Motion Estimation. It is challenging to estimate multiple motions in one model since the motion’s complexities and patterns significantly differ across various tasks. For this purpose, we proposed a flexible and hierarchical architecture to disentangle the motion estimation at the region level and pixel level. As shown in Fig. 5, better localization performance of the formed mesh can be achieved by increasing the number of TPS points. Besides, the pixel-level residual flow can provide a higher degree of freedom for the motion than only the region-level motion representation, improving the warping results, particularly in the image boundaries and details. Table III quantitatively demonstrates the effectiveness of the proposed hierarchical motion estimation module. We also found the upper bound occurs when continuously increasing the number of control points, e.g, the performance of four motion estimation heads to predict the size of 16×16161616\times 1616 × 16 TPS even worse than the size of 12×12121212\times 1212 × 12 TPS. This suggests the performance of multiple-in-one image warping would be limited without proper decoupling motion manner.

TABLE IV: Ablation study on different task classifiers for the multiple-in-one image warping framework.
Metrics w/o Classifier Classifier-Image Classifier-Point Ours
PSNR 20.48 20.58 20.63 20.83
SSIM 0.6418 0.6451 0.6463 0.6558
Parameters - 3.39M 0.0592M 0.0594M
Refer to caption
Figure 6: Evaluations of the multi-task learning and effectiveness of the proposed prompt learning module. The normalized PSNR and SSIM of validation data are visualized for six tasks. We recap different task types at the bottom.

Task Classifier. As the image-based classifier involving redundant image features burdens the main image warping framework, we propose a lightweight point-based classifier to learn the task type from each input image. As listed in Table IV, four baseline models are designed: image warping model without the task classifier, with the image-based classifier, with the point-based classifier, and with the point-based classifier compensated with the pooling global features (Ours). Note that we validate different task classifiers only with the TPS prediction modules, showing their direct influence on the region-level motion estimation. The quantitative results demonstrate that we can obtain evident performance gain beyond the vanilla network by adding the task classifier, indicating that the information on task type is meaningful in multi-task training.

Refer to caption
Figure 7: Generalization evaluation of the proposed method: cross-domain evaluation (top) and zero-shot evaluation (bottom). The visualized flow and control points are both predicted by MOWA. In image retargeting results, the red dotted lines measure the stretch extent of the face or body by warping operations.

IV-D Analysis on Multi-Task Learning and Prompt Learning

In the proposed method, a prompt learning module is designed to modulate the feature map with the soft task label, which helps to dynamically navigate its extensive parameter space to achieve task-aware image warping. By combining this module into MOWA, the averaged PSNR metric of all warping results gains +0.210.21+0.21+ 0.21dB improvements beyond the baseline.

To further analyze the influence of multi-task learning and the effectiveness of the proposed prompt learning, we visualize the quantitative metrics of the warping results of validation images for each task. As illustrated in Figure 6, the values of normalized PSNR and SSIM are plotted along different training epochs. Considering the ranges of PSNR and SSIM of different warping tasks significantly differ from each other, we normalize all values into the range of [0,1]01[0,1][ 0 , 1 ] to eliminate the data bias. For the portrait photos, we leverage the available warping flow in its training dataset and obtain the corrected images to compute the PSNR and SSIM. For other tasks, we directly use the ground truth of warped images in the corresponding datasets.

Refer to caption
Figure 8: While our method is trained on the synthetic datasets with clean masks, it shows good robustness to the real-world data with noisy pseudo masks. The motion estimation of real-world images, e.g., the predicted control points, and warping flow, exhibit structurally reasonable distributions similar to those of the synthetic images.

From Figure 6 (a), we have the following three observations: (1) Different tasks show various levels of difficulty when training a multiple-in-one image warping model. For example, learning to warp the unrolling shutter image (task 3) is generally easier than other tasks, which shows the fastest convergence at the first 80 epochs. The reason is that the structures of unrolling shutter images are basically regular, where more than two boundaries are straight and do not need to be warped. On the contrary, learning to warp the stitched images (task 1), rectified wide-angle images (task 2), and portrait photos (task 6) is more challenging (slow convergences can be observed) since their boundaries or expected motions vary greatly in datasets. Especially in the portrait photo dataset, the numbers, shapes, and locations of the faces are quite diverse. (2) The multiple-in-one model tends to sacrifice the performance of some individual tasks to achieve the improvement of holistic warping performance. Particularly, the accuracy of warping the unrolling shutter image is dramatically reduced since the 80th epoch, but the performance of other tasks continues to improve. Such a trade-off among different tasks facilitates an overall improvement but leads to performance conflicts between some tasks. (3) The relationship of different tasks can be positive or negative. For example, the curves of the stitched image and rotated image show consistent trends during the MOWA’s training as they share the similar warping principle, i.e., rectangling the irregular image boundaries and keeping the content unchanged. Thus, meaningful interactions between these two tasks could happen if learning a multiple-in-one model. However, the curve of warping the fisheye image shows a converse trend to those of the stitched image and rotated image. The reason for this is that correcting the radial distortion present in fisheye images significantly alters the scene’s layout while leaving the image boundaries almost untouched. This approach contrasts with the foundational principles of tasks based on image rectangling. As a result, the imbalanced convergences can be noticed across these tasks with negative relationships.

To dynamically boost the task-aware image warping in a single model, we propose a prompt learning module and a point-based task classifier. As shown in Figure 6 (b), the performance conflicts of different tasks are relieved by prompt learning. All tasks show a similar improvement trend as the training epoch increases, without dramatic performance degradation in certain individual tasks. More importantly, the multiple-in-one framework achieves the unified and best warping performance on all tasks at the end of training epochs. This phenomenon suggests that the framework knows to discriminate and warp different input types using the learned task-specific prior knowledge. Our designed prompts enable MOWA to efficiently traverse and harness its vast parameter space to meet various warping requirements.

We also visualize the t-SNE of the learned prompts of MOWA in Figure 6 (c). We can observe these prompts are well-clustered according to the task types. This clear clustering demonstrates the ability of the prompts to learn and represent discriminative motion context, which significantly aids in holistic image warping. The visualization underscores the effectiveness of our approach in capturing and leveraging task-specific features to enhance model performance.

Refer to caption
Figure 9: Failure cases of the proposed approach. It fails to accurately warp the input image with challenging image boundaries using a certain number of control points.

IV-E Generalization Evaluation

We show the generalization ability of the proposed method in terms of the cross-domain and zero-shot evaluations in Fig. 7. For the cross-domain evaluation, the new inputs belong to the above six practical image warping tasks but they are captured in real-world settings with different cameras, resolutions (can be up to 4K), and scenes. For the zero-shot evaluation, we consider a new image warping task, i.e., image retargeting, which aims to flexibly change the image scale without distorting the content as much as possible.

The visualization results demonstrate that MOWA can well extend to real-world scenarios, though its training datasets are mostly synthesized by hand-crafted camera models or manipulation spaces. One possible reason is the multiple-in-one model can naturally address the overfitting issue on specific datasets by learning various tasks. In addition to the cross-domain evaluations, we find that while our model does not involve the image retargeting task during training, it is still able to warp the image based on the “content-aware” principle. As we can observe, the predicted control points are accurately aligned to the image boundaries. Such knowledge transferring to new tasks potentially benefits from the shared motion perception across different tasks. Therefore, our results show fewer geometrical distortions for the foreground than the crop and resize operation, with the face and body experiencing less stretching.

It is noticed that our method also exhibits satisfactory robustness to noisy data. For instance, in Figure 8, although MOWA is trained on the synthetic dataset with clean masks, the warping results of the real-world dataset with noisy pseudo masks (the clean masks are not available in some practical applications) are still structurally reasonable.

IV-F Limitation Discussion

We show some failure cases in Figure 9. In these cases, we can find that the image boundaries are more irregular and the expected displacements of warping are more complicated than most samples. Consequently, it is challenging to approximate the accurate motion structure with a certain number of control points. This limitation could be addressed by adding more control points and cascading more TPS regression heads. Besides, scaling up the resolution of the input image could potentially improve the warping performance on image boundaries and details.

V Conclusion

We have proposed MOWA in this work, the first multiple-in-one image warping framework in the field of computational photography. It considers six representative and practical tasks in one learning model and uses a unified motion representation to achieve various warping purposes. In particular, to mitigate the difficulty of approximating diverse motions of different tasks, we propose to disentangle the motion estimation at both the region level and pixel level. Then, we enable MOWA the explicit task-aware ability by introducing a lightweight point-based classifier. Compared to the common image-based classifier, it can achieve comparable performance while offering significant parameter reduction. Subsequently, we feed the task label predicted by the task classifier into a prompt learning module and further modulate the feature maps in the decoder, which facilitates a single model to efficiently navigate and leverage its extensive parameter space to meet various warping requirements. Comprehensive experiments demonstrate that MOWA outperforms different SotA methods specifically designed for each single task, with an affordable model size. In the future, we plan to empower MOWA with cross-view and cross-modal abilities, aiming to build a foundation model for universal image warping.

References

  • [1] H. S. Sawhney and R. Kumar, “True multi-image alignment and its application to mosaicing and lens distortion correction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 3, pp. 235–243, 1999.
  • [2] R. Hartley and S. B. Kang, “Parameter-free radial distortion correction with center of distortion estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 8, pp. 1309–1321, 2007.
  • [3] H. Feng, W. Wang, J. Deng, W. Zhou, L. Li, and H. Li, “Simfir: A simple framework for fisheye image rectification with self-supervised representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 418–12 427.
  • [4] W. Wang, H. Feng, W. Zhou, Z. Liao, and H. Li, “Model-aware pre-training for radial distortion rectification,” IEEE Transactions on Image Processing, 2023.
  • [5] K. He, H. Chang, and J. Sun, “Rectangling panoramic images via warping,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, pp. 1–10, 2013.
  • [6] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Deep rectangling for image stitching: a learning baseline,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5740–5748.
  • [7] D. Li, K. He, J. Sun, and K. Zhou, “A geodesic-preserving method for image warping,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 213–221.
  • [8] Y. Zhang, Y.-K. Lai, and F.-L. Zhang, “Content-preserving image stitching with piecewise rectangular boundary constraints,” IEEE transactions on visualization and computer graphics, vol. 27, no. 7, pp. 3198–3212, 2020.
  • [9] J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1335–1340, 2006.
  • [10] J. P. Barreto and H. Araujo, “Geometric properties of central catadioptric line images and their application in calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1327–1333, 2005.
  • [11] D. Herrera, J. Kannala, and J. Heikkilä, “Joint depth and color camera calibration with distortion correction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, pp. 2058–2064, 2012.
  • [12] S. Ramalingam and P. Sturm, “A unifying model for camera calibration,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1309–1319, 2016.
  • [13] K. Liao, L. Nie, S. Huang, C. Lin, J. Zhang, Y. Zhao, M. Gabbouj, and D. Tao, “Deep learning for camera calibration and beyond: A survey,” arXiv preprint arXiv:2303.10559, 2023.
  • [14] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision.   Cambridge University Press, 2003.
  • [15] X. Liu, Y. Zhao, and S.-C. Zhu, “Single-view 3d scene reconstruction and parsing by attribute grammar,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 710–725, 2017.
  • [16] C. Häne, C. Zach, A. Cohen, and M. Pollefeys, “Dense semantic 3d reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1730–1743, 2016.
  • [17] S. Kumar, Y. Dai, and H. Li, “Superpixel soup: Monocular dense 3d reconstruction of a complex dynamic scene,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1705–1717, 2019.
  • [18] J. Ping, Y. Liu, and D. Weng, “Comparison in depth perception between virtual reality and augmented reality systems,” in 2019 IEEE conference on virtual reality and 3d user interfaces (VR).   IEEE, 2019, pp. 1124–1125.
  • [19] S. Thanyadit, P. Punpongsanon, and T.-C. Pong, “Investigating visualization techniques for observing a group of virtual reality users using augmented reality,” in 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).   IEEE, 2019, pp. 1189–1190.
  • [20] A. Genay, A. Lécuyer, and M. Hachet, “Being an avatar “for real”: a survey on virtual embodiment in augmented reality,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 12, pp. 5071–5090, 2021.
  • [21] K. Liao, L. Nie, C. Lin, Z. Zheng, and Y. Zhao, “Recrecnet: Rectangling rectified wide-angle images by thin-plate spline model and dof-based curriculum learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 10 800–10 809.
  • [22] H. Feng, W. Zhou, J. Deng, Y. Wang, and H. Li, “Geometric representation learning for document image rectification,” in European Conference on Computer Vision, 2022, pp. 475–492.
  • [23] H. Feng, W. Zhou, J. Deng, Q. Tian, and H. Li, “Docscanner: robust document image rectification with progressive learning,” arXiv preprint arXiv:2110.14968, 2021.
  • [24] K. He, H. Chang, and J. Sun, “Content-aware rotation,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 553–560.
  • [25] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical wide-angle portraits correction with deep structured models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3498–3506.
  • [26] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-supervised wide-angle portraits correction by multi-scale transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
  • [27] F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition of deformations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 6, pp. 567–585, 1989.
  • [28] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using invariant features,” International journal of computer vision, vol. 74, pp. 59–73, 2007.
  • [29] C.-C. Lin, S. U. Pankanti, K. Natesan Ramamurthy, and A. Y. Aravkin, “Adaptive as-natural-as-possible image stitching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1155–1163.
  • [30] F. Zhang and F. Liu, “Parallax-tolerant image stitching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3262–3269.
  • [31] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic radial distortion rectification using conditional gan in real-time,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 3, pp. 725–733, 2019.
  • [32] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib: A deep learning approach for automatic intrinsic calibration of wide field-of-view cameras,” in Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production, 2018, pp. 1–10.
  • [33] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14.   Springer, 2016, pp. 286–301.
  • [34] M. Liu, X. He, and M. Salzmann, “Geometry-aware deep network for single-image novel view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4616–4624.
  • [35] I. Daribo and B. Pesquet-Popescu, “Depth-aided image inpainting for novel view synthesis,” in 2010 IEEE International workshop on multimedia signal processing.   IEEE, 2010, pp. 167–170.
  • [36] H. Zhou, Y. Zhu, X. Lv, Q. Liu, and S. Zhang, “Rectangular-output image stitching,” in 2023 IEEE International Conference on Image Processing (ICIP).   IEEE, 2023, pp. 2800–2804.
  • [37] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively complementary network for fisheye image rectification using appearance flow,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6348–6357.
  • [38] W. Yan, R. T. Tan, B. Zeng, and S. Liu, “Deep homography mixture for single image rolling shutter correction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9868–9877.
  • [39] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter unrolling network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5941–5949.
  • [40] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the shutter: Cnn to correct motion distortions,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2017, pp. 2291–2299.
  • [41] Y. Shih, W.-S. Lai, and C.-K. Liang, “Distortion-free wide-angle portraits on camera phones,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
  • [42] Z. Liao, W. Zhou, and H. Li, “Dafir: Distortion-aware representation learning for fisheye image rectification,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [43] N. S. Detlefsen, O. Freifeld, and S. Hauberg, “Deep diffeomorphic transformer networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4403–4412.
  • [44] I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network architecture for geometric matching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6148–6157.
  • [45] W. Li, Y. Lu, K. Zheng, H. Liao, C. Lin, J. Luo, C.-T. Cheng, J. Xiao, L. Lu, C.-F. Kuo et al., “Structured landmark detection via topology-adapting deep graph learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16.   Springer, 2020, pp. 266–283.
  • [46] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  • [47] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
  • [48] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  • [49] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 683–17 693.
  • [50] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Deep rotation correction without angle prior,” IEEE Transactions on Image Processing, vol. 32, pp. 2879–2888, 2023.