MOWA: Multiple-in-One Image Warping Model
Abstract
While recent image warping approaches achieved remarkable success on existing benchmarks, they still require training separate models for each specific task and cannot generalize well to different camera models or customized manipulations. To address diverse types of warping in practice, we propose a Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level. To further enable dynamic task-aware image warping, we introduce a lightweight point-based classifier that predicts the task type, serving as prompts to modulate the feature maps for more accurate estimation. To our knowledge, this is the first work that solves multiple practical warping tasks in one single model. Extensive experiments demonstrate that our MOWA, which is trained on six tasks for multiple-in-one image warping, outperforms state-of-the-art task-specific models across most tasks. Moreover, MOWA also exhibits promising potential to generalize into unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code and more visual results can be found on the project page: https://kangliao929.github.io/projects/mowa/.
Index Terms:
Image Warping, Multiple-in-One Model, Prompt Learning.I Introduction
Image warping is essential in the field of computational imaging and computer vision, serving as the foundation for numerous applications, including image rectification [1, 2, 3, 4], image rectangling [5, 6, 7, 8], camera calibration [9, 10, 11, 12, 13], and 3D reconstructions [14, 15, 16, 17], etc. Enabling the manipulation of image data through processes such as scaling, rotation, and sheering allows for the seamless integration of diverse visual elements and the correction of optical imperfections. Moreover, image warping is indispensable in developing augmented reality (AR) and virtual reality (VR) applications [18, 19, 20], where it helps create immersive and realistic environments by accurately mapping textures and images onto 3D models.
Considering different inputs derived from different camera models or manipulation spaces, recent works integrate specific prior knowledge into their models to address the corresponding image warping tasks [6, 5, 21, 4, 22, 23, 24, 25, 26]. While these single-task approaches achieve significant progress, we found they suffer from two main limitations: (i) the lack of generalization and flexibility, which restricts their real-world applications since users are required to manually identify each input type and apply the appropriate single-task model. This process is time-consuming and challenging for non-professional users to judge. (ii) the substantial storage requirements for multiple task-specific models, which is impractical for some resource-limited platforms. Thus, it is crucial to develop a holistic framework capable of efficiently warping images from various camera models or manipulation spaces. Furthermore, many image warping tasks typically involve shared processes, such as motion estimation and content-aware perception. This indicates the possibility of developing a unified framework that incorporates these common image techniques.
In this work, we propose a Multiple-in-One image WArping model (named MOWA) to address various tasks in practice, as shown in Fig. 1. Specifically, we consider six representative types in the field of computational photography, namely stitched images, rectified wide-angle images, unrolling shutter images, rotated images, fisheye images, and portrait photos, covering the mainstream practical image warping tasks.
Given the fact that learning different structures of motion is non-trivial in one model and motion representations differ significantly across various tasks, we propose to disentangle the motion estimation at both the region level and pixel level. In this hierarchical architecture, we first estimate the control points of the thin-plate spline (TPS) model [27] with increasing refined numbers, in which the feature maps are progressively warped and rectified. Such a representation excels in approximating complex motions at a region level and enables high flexibility to various motion structures. Subsequently, the warped feature maps are fed into the decoder to predict the residual pixel-level displacement, which further improves the warping results for each task, especially in the image boundaries and details.
To enable MOWA to explicitly discriminate diverse input types, a lightweight point-based classifier is devised. Adding an extra classification network based on the image features is a straightforward solution but brings high computation and storage costs. Noticing the motion structures in different warping tasks possess their specific distribution, we leverage the middle product of the image warping framework, i.e., region-level control points, to directly learn the task type. It achieves comparable performance while allowing significant parameter reduction compared to the image-based classifier since only a few 2D points are needed. Then, the task label predicted by this point-based classifier is used to modulate the feature maps in the decoder, dynamically boosting task-aware image warpings using a prompt learning module. Prompts are a set of learnable parameters that encapsulate essential discriminative information about different types of input, which empower a single model to efficiently traverse and harness its vast parameter space to accommodate various warping requirements.
In the experiments, we trained MOWA on six typical tasks for multiple-in-one image warping. Experimental results demonstrated that it outperforms state-of-the-art (SotA) task-specific models in most tasks, even with comparable parameters of the network. In addition, MOWA allows the ability to generalize to unseen scenes, as evidenced by cross-domain evaluation (unfamiliar domains) and zero-shot evaluation (unseen tasks), indicating its robustness and adaptability across various scenarios. Our contributions can be summarized as follows:
-
•
We propose MOWA, which is the first practical multiple-in-one image warping framework. This proposed model, despite with an affordable model size, still evidently outperforms most SotA methods.
-
•
We propose to mitigate the difficulty of multi-task learning by decoupling the motion estimation in both the region level and pixel level. Moreover, a prompt learning module, guided by a lightweight point-based classifier, is designed to facilitate task-aware image warpings.
-
•
We show that through multi-task learning, our framework develops a robust generalized warping strategy that gains improved performance across various tasks and even generalizes to unseen tasks.
II Related Work
Image warping is the process of manipulating an image to change its shape or alignment. This transformation is achieved by applying a spatial mapping function to the coordinates of the original image, resulting in a new image with altered geometry. In computational photography, image warping is a key technique for enhancing and manipulating images beyond traditional photography limits. This technique enables the creation of panoramic images [28, 29, 30], the correction of lens distortions [31, 32, 9], and the synthesis of novel views [33, 34, 35], etc. In the past few decades, warping techniques have significantly contributed to the development of advanced imaging applications beyond those mentioned above, offering greater flexibility and creativity. For example, the image boundaries can be twisted by different manipulations, leading to visually unpleasant layouts and negative effects on downstream vision tasks. Nevertheless, in practical scenarios, most users favor rectangular boundaries due to their compatibility with standard display formats, facilitating ease of sharing, printing, and publication [5, 24]. Therefore, researchers have developed diverse image rectangling methods to warp the image boundaries to be straight [5, 6, 21, 7, 8, 36]. Most of them follow the principle of content-aware image warping to avoid the large distortion on the original distribution when rectangling the image. Besides, different motion representations are also exploited, such as the mesh [5, 6] and control points [21], to formulate the warping process.
Excluding the customized manipulations, some special camera models can introduce geometric distortion onto the captured images, , radial distortion, rolling shutter distortion, and perspective distortion. The images’ semantic features significantly disobey the real-world rules due to those distortions. To address this issue, there is an exploration of distortion correction approaches [31, 32, 37, 3, 38, 39, 40, 25, 41, 26, 4] aimed at warping the distorted input to a geometrically reasonable one. Particularly, regression-based methods [32, 40] learn the camera and distortion parameters from the input image and correct the distortion by simulating the imaging process of a predefined camera model. In contrast, reconstruction-based methods [31, 37, 3, 38, 25, 42] directly learn the pixel-wise displacement between the distorted image and its ground truth, facilitating the model-free correction and enabling the end-to-end training.
The above works achieve remarkable progress on various tasks, of which well-designed network architectures and tailored motion representations are studied. However, they need to train an individual model for each specific warping type and require prior knowledge of the camera model or customized manipulations. In this work, we propose a multiple-in-one framework to involve these typical and practical image warping tasks. We address the challenge of learning different motion structures within a single model by employing a coarse-to-fine approach, progressively adding more TPS points to accurately fit the expected geometric distribution. To compensate additional degrees of freedom for TPS, our method further learns the residual flow based on the warped feature map, allowing for tuning of image boundaries and details in the final results.
III Multiple-in-One Warping Model
III-A Problem Definition
In this study, we consider six representative and practical image types in the field of computational photography, including stitched images, rectified wide-angle images, unrolling shutter images, rotated images, fisheye images, and portrait photos, covering the mainstream practical image warping tasks. These types are further classified into two groups. The first four types (stitched images, rectified wide-angle images, unrolling shutter images, and rotated images) struggle with irregular boundaries as the original images are manipulated by some customized operations, such as image stitching, distortion correction, and rotation. Therefore, image rectangling is proposed to reshape these irregular boundaries while keeping the distribution of content unchanged. The last two types (fisheye images and portrait photos) show inherently geometric distortions imaged by special camera models, such as radial distortion in fisheye images and perspective distortion in portrait photos. Correcting these distortions is crucial to scene understanding and aesthetic appreciation.
Figure 2 shows the overall framework of the proposed MOWA. It takes the image and mask as input and estimates the TPS control points with increasingly refined numbers. In this region-level motion estimation, the feature maps are progressively warped and rectified. Subsequently, the warped features are fed into the decoder to predict a residual pixel-level motion. To enable task-aware and expandable capabilities, a lightweight point-based classifier and prompt learning module are designed. We elaborate on the details of each module in the proposed multiple-in-one image warping framework as follows.
III-B Motion Estimation Module
Learning multiple warping types in one model is challenging since the network needs to balance different complexities of the multiple motion types in motion estimation. Furthermore, the model’s scalability would be restricted if the motion representation is hand-crafted for specific tasks. Hence, we propose a flexible and hierarchical architecture for general image warping in MOWA. As shown in Fig. 2, the motion estimation is disentangled at both the region level, where the number of TPS control points progressively increases, and the pixel level, where a residual map is predicted to further compensate the estimated TPS flow.
Region-Level Motion Estimation. The TPS transformation [27] stands out for its remarkable ability to model complex motions [43, 44, 45]. It is adept at performing image warping based on two sets of region-level control points, namely for the source image and for the target image. To minimize the distortion of the source and target images, an energy term is introduced to penalize the Euclidean distance between the transformed source points and the target points , i.e., . This penalty results in a spatial deformation function parameterized by the control points, effectively capturing the intricate deformations across the image and maintaining the overall structural integrity. Specifically, the derived spatial deformation function can be expressed as follows:
(1) |
where represents a point located in the source image. and are the transformation parameters, is a radial basis function to quantify the influence of the control point, more details can be found in literature [27]. Notably, this deformation function plays a key role in determining the deformation induced by each control point, thereby shaping the overall transformation.
Motion estimation acts as the fundamental stage in image warping, presenting particular challenges in the context of multi-in-one task learning. To enhance the capability of our model in motion estimation, we design a progressive motion estimation module. More specifically, this module cascades a sequence of TPS transformation heads that gradually increase the number of control points. The control points predicted by the preceding head are upsampled and integrated into the prediction of the next head. Subsequently, these control points are arranged to generate a mesh. Then we adopt the TPS transformation to warp this mesh, aiming to align it with the regular mesh defined on the ground truth image. In the implementation, considering the cascade of fully connected layers introduces significant computation and storage costs, we use one or two convolution layers to predict the control points after each TPS transformation head. The pipeline of cascaded TPS transformation heads can be expressed by:
(2) |
where is the -th TPS transformation head, and are the feature map and control points of the -th head, respectively. represents the warping operation for feature maps given control points, and is a customized upsampling layer for control points.
Pixel-Level Motion Estimation. While TPS transformation is flexible and adaptable to various tasks, it is limited in its ability to describe detailed motions due to its restricted degree of freedom. To alleviate this limitation, we further complement the region-level motion representation with a pixel-level residual flow. Specifically, we first rectify the feature map using the corresponding control points in the last transformation head, and then feed the rectified feature map into a decoder network to predict the desirable residual flow. Like common U-Net architectures, the shallow features in the encoder are transited into the decoder using skip-connection. To eliminate the blur effect by multiple warpings (interpolation operation involved), we densify the TPS control points to pixel level and couple it with the residual flow to directly warp the input image . The final warping result can be obtained by:
(3) |
where denotes the warping operation given the flow map and input image, densifies the sparse control points to a dense flow map, which can be regarded as a special case of TPS upsampling layer . Unlike previous works tailored for specific warping tasks, our method unifies motion representation across various tasks at both the region and pixel levels. The experimental results are demonstrated in Section IV-C.
III-C Point-based Task Classifier
When learning various image warping tasks simultaneously, a task classifier is crucial for efficiently routing inputs to their respective task-specific components, optimizing resource use, and enhancing model performance. It is straightforward to design a task classifier in terms of the input image. However, such a design brings an unavoidable issue of high computation complexity due to the redundant image features. Instead, we propose a lightweight task classifier based on the TPS points predicted by the motion estimation module. Our motivation stems from the fact that the motion structures in different tasks possess their specific distribution, which potentially exists in a point space as shown in Fig. 3. To this end, we design a PointNet-like network [46, 47] to predict the task type. Specifically, as shown in Fig. 2, it takes the local coordinates of motion with the global image features (after maxpooling) by point-wise concatenation along the last dimension as input and outputs the soft task label . We can formulate this point-based task classifier as follows:
(4) | |||
(5) |
where and are the fully connected layers to decrease the dimensions of features and learn the abstract concepts. R denotes replicating to the same shape of the predicted motion coordinates and is the point-wise concatenation. Experiments demonstrate our point-based task classifier achieves comparable results while having less than parameters compared with the image-based classifier.
In addition to the task classification function, our point-based task classifier can further improve image warping performance. This improvement is due to the high-level guidance provided by the task classifier to the motion estimation module through gradient back-propagation. Figure 3 (right) depicts a typical example, in which the fisheye rectification result shows a less distorted shape, and the predicted control points of the stitched image are more tightly aligned to the image boundary. Compared to the vanilla baseline, the proposed point-based task classifier achieves an average improvement of in PSNR metrics across various image warping tasks. More quantitative results are presented in Section IV-C.
III-D Prompt Learning Module
Once the inputs are classified by the proposed point-based network, we leverage the predicted task label to modulate the feature maps in the network. In particular, a prompt learning block is inserted into each layer in the decoder as a plug-and-play module. Prompt learning aims to tackle the challenge of generalizing in various image warping tasks by aiding the network in comprehending the specific task at hand. The prompts serve as a flexible and lightweight component to encode motion context across multiple scales within the image warping network.
Assuming the task number is , we introduce a set of learnable parameters as our prompts, namely . By denoting the predicted task label by the task classifier as , we modulate the feature maps in the decoder network by the prompts as follows:
(6) |
where represents the concat operation, is a convolution layer with kernel size aiming to reduce the channel dimension of concatenated features.
By integrating the learnable prompts with the features of the warping model, we can significantly enrich the representations with task-specific knowledge. Unlike pre-defined and fixed prompts, our adaptive approach enables the network to dynamically influence its behavior, resulting in more efficient and precise image warping. This adaptive process not only enhances the flexibility of the model but also improves its ability to generalize across different tasks and datasets. More analysis on the multi-task learning and effectiveness of the proposed prompt learning are demonstrated in Section IV-D.
III-E Training Loss
After predicting the TPS control points and the residual flow, the warped image can be obtained by Eq. (3). Following previous works [6, 21], we first exploit three losses to train our multiple-in-one image warping framework, e.g., image reconstruction loss , perceptual loss , and inter-grid loss . The reconstruction loss and perceptual loss supervise the warped image at the pixel level and feature level, respectively. The inter-grid loss constrains the edges of two consecutive deformed grids to be co-linear:
(7) |
Here, represents the number of tuples of two successive edges in a mesh . When maximizing the above cosine representation, the corresponding two edges become collinear. Consequently, the loss reaches its minimum, ensuring the image content remains consistent.
Considering the ground truth of warping flow is available in the training dataset of portrait photos, we also add the reconstruction loss on the predicted flow of the portrait correction task. Moreover, we provide middle-level supervision on the warped results from the TPS prediction heads with a set of exponentially growing weights. To train the point-based task classifier, the standard cross-entropy loss is applied. Overall, the final loss can be expressed by:
(8) |
where and are the hyper-parameters to balance different losses, both of them are empirically set to .
In summary, the proposed multiple-in-one image warping framework bring the following benefits.
-
•
Unlike previous task-specific image warping models, our method can recover various geometrically distortion images within a single network. It does not require prior knowledge of the camera models or manipulation spaces; it is also friendly to use and relies only on the observed input image to perform the customized image warping.
-
•
Our method provides greater flexibility and cost-effectiveness in real-world scenarios, unlike previous methods that need a proportionally larger model size as the number of warping tasks increases.
-
•
Thanks to multi-task learning, our method develops a generalized motion representation across various image warping tasks, demonstrating remarkable performance in cross-domain evaluations and unseen tasks.
IV Experiments
To demonstrate the effectiveness of the proposed multiple-in-one image warping method, we evaluate its performance on six representative distorted types, including stitched images, rectified wide-angle images, unrolling shutter images, rotated images, fisheye images, and portrait photos, covering the mainstream practical image warping tasks.
IV-A Experimental Settings
Implementation Details. We train the proposed model using the Adam optimizer with the momentum terms of on 8 NVIDIA A100 GPUs. The learning rate starts with a linear warm-up in the first three epochs and then decays from to following a cosine schedule in the remaining epochs. The batch size is set as 64. The complete framework is trained with a fixed input size of . At the first 10 epochs, we solely train and supervise the TPS prediction heads with the point-based task classifier. Afterwards, all modules are trained collectively. During inference, the proposed method supports image warping for any resolution by scaling the predicted TPS control points and residual flow.
Network Configuration. We design the image warping network based on the encoder-decoder architecture, enabling both region-level control point regression and pixel-level residual flow prediction. Specifically, the Transformer blocks with shifted windows [48, 49] are used in both the encoder and decoder except for the input projection layer and output projection layer. The basic dimension of channels is set to 32 and linearly increases along the layers in the encoder network, which is oppositely decreased to 2 in the decoder network. Moreover, the depths of each Transformer block are set to 2 and the head numbers of multi-head self-attention are along the whole layers. In TPS prediction heads, we adopt the convolution layers with different kernels to predict increasing numbers of control points, and the numbers are set to , , , and . The configuration details of these regression heads are listed in Table I. Such a design enables significant parameter reduction compared with the fully connected layers. For the lightweight point-based classifier, three 1D convolutional layers with channel dimensions of are used to extract the features of input and then three fully connected layers with unit numbers of are used to classify their task types.
Configuration | |||||
---|---|---|---|---|---|
Kernel Size | {, } | ||||
Stride | 2 | {1, 1} | 1 | 1 | 1 |
Padding | 1 | {0, 0} | 0 | 0 | 1 |
Warping Tasks | Metrics | ||||||
Input Type | Methods | PSNR | SSIM | ShapeAcc | Parameter | ||
Rotated Image | He et al. [5] | 17.63 | 0.4880 | - | - | ||
Deep_Rect [6] | 19.89 | 0.5500 | - | 52.14M | |||
Ours | 21.01 | 0.5961 | - | 49.93M | |||
Rectified Wide-Angle Image | He et al. [5] | 15.36 | 0.4211 | - | - | ||
RecRecNet [21] | 18.68 | 0.5450 | - | 62.70M | |||
Ours | 18.69 | 0.5450 | - | 49.93M | |||
Stitched Image | He et al. [5] | 14.70 | 0.3775 | - | - | ||
Deep_Rect [6] | 21.28 | 0.7140 | - | 52.14M | |||
Ours | 20.72 | 0.6425 | - | 49.93M | |||
Unrolling Shutter Image | RecRecNet [21] | 21.48 | 0.7602 | - | 62.70M | ||
Ours | 21.69 | 0.7795 | - | 49.93M | |||
Fisheye Image | PCN [37] | 21.37 | 0.6925 | - | 26.19M | ||
Feng et al. [3] | 21.72 | 0.7167 | - | 11.65M | |||
Ours | 22.25 | 0.7488 | - | 49.93M | |||
Portrait Photos | Shih et al. [41] | - | - | 97.253 | - | ||
Tan et al. [25] | - | - | 97.490 | - | |||
Zhu et al. [26] | - | - | 97.491 | 8.79M | |||
Ours | - | - | 97.477 | 49.93M |
Datasets. We use the public benchmarks from recent SotA works, including the image rectangling datasets [6, 50, 21] and the distortion correction datasets [37, 25, 26]. Since there is no available training dataset for unrolling shutter image rectangling, we use the rolling shutter correction dataset [38] and follow the standard data construction process from previous methods [6] to synthesize the paired data. This dataset will also be made public.
Metrics. Following previous works, we select PSNR and SSIM as metrics to quantitatively measure the quality of the warped results. Please note that it is challenging to use the Average Endpoint Error (EPE) metric to evaluate the motion estimation performance in practical image warping tasks, because the accurate labels of motion are hard to obtain and unavailable in all the above test datasets. As a consequence, most previous methods have opted to learn the motion in an unsupervised manner and supervise the image warping model at the warped pixel level.
For the portrait correction task, the ShapeAcc metric is applied as suggested in Tan et al. [25]. It is specially designed for the quality of face correction, which calculates the similarity between corrected portraits and the stereographic projection of its original input.
IV-B Comparison Results
We compare the proposed MOWA with recent SotA methods on each task, including Deep_Rect [6], He et al. [5], RecRecNet [21], PCN [37], Feng et al. [3], Shih et al. [41], Tan et al. [25], and Zhu et al. [26].
Qualitative Comparison. As shown in Fig. 4, we visualize the comparison results of different methods on the testing datasets. These qualitative results demonstrate that our multiple-in-one method can handle various tasks, scenes, and resolutions well, compared with the SotA methods specially designed for each task. For example, for the rotated images, our method can rearrange the input to a rectangle one while keeping the original geometric layout reasonable. On the contrary, distorted buildings can be observed in the results of previous works [5, 6], in which the physical world rules such as the horizon are perturbed. For other rectangling tasks like the stitched image, unrolling shutter image, and rectified wide-angle image, our method shows a better visual appearance, especially in the image boundaries, allowing promising structural integrity among the comparison methods. In some challenging cases, such as the first and second rows in stitched images, the image boundaries are dramatically stretched, but our method can still warp the images to expected structures. One important reason is that MOWA learns the generalized warping strategy from different tasks since it can extract some common knowledge from them. In addition, our method mitigates the difficulty of motion estimation by disentangling it at both the region level and pixel level. Consequently, diverse structures of motions can be progressively approximated and the image details can be preserved. For the fisheye image and portrait photos, MOWA is capable of recovering the realistic geometric distribution from the inputs, despite the radial distortion or perspective distortion. Please refer to more visual comparison images, interactive warping visualizations, and dynamic warping results on the project page: https://kangliao929.github.io/projects/mowa/.
Quantitative Comparison. We report the quantitative evaluation results in Table II. The proposed multiple-in-one image warping jointly learns six tasks and achieves promising performance compared with the single-task methods. For example, MOWA outperforms the SotA methods in rectified wide-angle images, unrolling shutter images, rotated images, and fisheye images, thanks to the elaborately designed hierarchical motion estimation architecture and task-aware prompt learning strategy. Moreover, MOWA achieves comparable image warping performance for stitched images and portrait photos without intolerable performance degradation when involving more tasks and data. The results suggest the generalizability and flexibility of MOWA, which are not achievable by previous methods [21, 5, 3, 6, 25] that tailor the specific knowledge into their models to address the single image warping task.
Computation Complexity Comparison. In Table II, we also compare the computation complexity of the proposed method with previous methods that make their models available. The comparison suggests that our model size is reasonable and affordable as a multiple-in-one image warping framework. Even compared to the SotA models designed for the specific task [6, 21], our MOWA has fewer parameters to achieve better or comparable warping performance. The underlying reason is that the shared knowledge across different tasks can relieve the burden of the parameter requirements of a multi-task model. Besides, the proposed motion estimation module discards the heavy fully connected layers and replaces them with convolutional layers. Then, the predicted region-level TPS points are further compensated with the pixel-level displacement from a compact convolutional decoder.
IV-C Ablation Study
Considering the aim of a multiple-in-one framework is to achieve holistic performance across various tasks, we mainly compare the different variants of the framework in terms of the average warping metrics. Additionally, the same image quality metrics (PSNR and SSIM) are shared in the first five tasks, but the portrait correction task has its own metrics like ShapeAcc. Thus, the average PSNR and SSIM from the first five tasks are mainly reported in this part.
Metrics | Baseline | 12-12-12-12 | 14-14-14-14 | 16-16-16-16 | 10-12-14-16 | Ours |
---|---|---|---|---|---|---|
PSNR | 20.29 | 20.38 | 20.42 | 20.02 | 20.48 | 20.84 |
SSIM | 0.6279 | 0.6311 | 0.6406 | 0.6148 | 0.6418 | 0.6572 |
Motion Estimation. It is challenging to estimate multiple motions in one model since the motion’s complexities and patterns significantly differ across various tasks. For this purpose, we proposed a flexible and hierarchical architecture to disentangle the motion estimation at the region level and pixel level. As shown in Fig. 5, better localization performance of the formed mesh can be achieved by increasing the number of TPS points. Besides, the pixel-level residual flow can provide a higher degree of freedom for the motion than only the region-level motion representation, improving the warping results, particularly in the image boundaries and details. Table III quantitatively demonstrates the effectiveness of the proposed hierarchical motion estimation module. We also found the upper bound occurs when continuously increasing the number of control points, e.g, the performance of four motion estimation heads to predict the size of TPS even worse than the size of TPS. This suggests the performance of multiple-in-one image warping would be limited without proper decoupling motion manner.
Metrics | w/o Classifier | Classifier-Image | Classifier-Point | Ours |
---|---|---|---|---|
PSNR | 20.48 | 20.58 | 20.63 | 20.83 |
SSIM | 0.6418 | 0.6451 | 0.6463 | 0.6558 |
Parameters | - | 3.39M | 0.0592M | 0.0594M |
Task Classifier. As the image-based classifier involving redundant image features burdens the main image warping framework, we propose a lightweight point-based classifier to learn the task type from each input image. As listed in Table IV, four baseline models are designed: image warping model without the task classifier, with the image-based classifier, with the point-based classifier, and with the point-based classifier compensated with the pooling global features (Ours). Note that we validate different task classifiers only with the TPS prediction modules, showing their direct influence on the region-level motion estimation. The quantitative results demonstrate that we can obtain evident performance gain beyond the vanilla network by adding the task classifier, indicating that the information on task type is meaningful in multi-task training.
IV-D Analysis on Multi-Task Learning and Prompt Learning
In the proposed method, a prompt learning module is designed to modulate the feature map with the soft task label, which helps to dynamically navigate its extensive parameter space to achieve task-aware image warping. By combining this module into MOWA, the averaged PSNR metric of all warping results gains dB improvements beyond the baseline.
To further analyze the influence of multi-task learning and the effectiveness of the proposed prompt learning, we visualize the quantitative metrics of the warping results of validation images for each task. As illustrated in Figure 6, the values of normalized PSNR and SSIM are plotted along different training epochs. Considering the ranges of PSNR and SSIM of different warping tasks significantly differ from each other, we normalize all values into the range of to eliminate the data bias. For the portrait photos, we leverage the available warping flow in its training dataset and obtain the corrected images to compute the PSNR and SSIM. For other tasks, we directly use the ground truth of warped images in the corresponding datasets.
From Figure 6 (a), we have the following three observations: (1) Different tasks show various levels of difficulty when training a multiple-in-one image warping model. For example, learning to warp the unrolling shutter image (task 3) is generally easier than other tasks, which shows the fastest convergence at the first 80 epochs. The reason is that the structures of unrolling shutter images are basically regular, where more than two boundaries are straight and do not need to be warped. On the contrary, learning to warp the stitched images (task 1), rectified wide-angle images (task 2), and portrait photos (task 6) is more challenging (slow convergences can be observed) since their boundaries or expected motions vary greatly in datasets. Especially in the portrait photo dataset, the numbers, shapes, and locations of the faces are quite diverse. (2) The multiple-in-one model tends to sacrifice the performance of some individual tasks to achieve the improvement of holistic warping performance. Particularly, the accuracy of warping the unrolling shutter image is dramatically reduced since the 80th epoch, but the performance of other tasks continues to improve. Such a trade-off among different tasks facilitates an overall improvement but leads to performance conflicts between some tasks. (3) The relationship of different tasks can be positive or negative. For example, the curves of the stitched image and rotated image show consistent trends during the MOWA’s training as they share the similar warping principle, i.e., rectangling the irregular image boundaries and keeping the content unchanged. Thus, meaningful interactions between these two tasks could happen if learning a multiple-in-one model. However, the curve of warping the fisheye image shows a converse trend to those of the stitched image and rotated image. The reason for this is that correcting the radial distortion present in fisheye images significantly alters the scene’s layout while leaving the image boundaries almost untouched. This approach contrasts with the foundational principles of tasks based on image rectangling. As a result, the imbalanced convergences can be noticed across these tasks with negative relationships.
To dynamically boost the task-aware image warping in a single model, we propose a prompt learning module and a point-based task classifier. As shown in Figure 6 (b), the performance conflicts of different tasks are relieved by prompt learning. All tasks show a similar improvement trend as the training epoch increases, without dramatic performance degradation in certain individual tasks. More importantly, the multiple-in-one framework achieves the unified and best warping performance on all tasks at the end of training epochs. This phenomenon suggests that the framework knows to discriminate and warp different input types using the learned task-specific prior knowledge. Our designed prompts enable MOWA to efficiently traverse and harness its vast parameter space to meet various warping requirements.
We also visualize the t-SNE of the learned prompts of MOWA in Figure 6 (c). We can observe these prompts are well-clustered according to the task types. This clear clustering demonstrates the ability of the prompts to learn and represent discriminative motion context, which significantly aids in holistic image warping. The visualization underscores the effectiveness of our approach in capturing and leveraging task-specific features to enhance model performance.
IV-E Generalization Evaluation
We show the generalization ability of the proposed method in terms of the cross-domain and zero-shot evaluations in Fig. 7. For the cross-domain evaluation, the new inputs belong to the above six practical image warping tasks but they are captured in real-world settings with different cameras, resolutions (can be up to 4K), and scenes. For the zero-shot evaluation, we consider a new image warping task, i.e., image retargeting, which aims to flexibly change the image scale without distorting the content as much as possible.
The visualization results demonstrate that MOWA can well extend to real-world scenarios, though its training datasets are mostly synthesized by hand-crafted camera models or manipulation spaces. One possible reason is the multiple-in-one model can naturally address the overfitting issue on specific datasets by learning various tasks. In addition to the cross-domain evaluations, we find that while our model does not involve the image retargeting task during training, it is still able to warp the image based on the “content-aware” principle. As we can observe, the predicted control points are accurately aligned to the image boundaries. Such knowledge transferring to new tasks potentially benefits from the shared motion perception across different tasks. Therefore, our results show fewer geometrical distortions for the foreground than the crop and resize operation, with the face and body experiencing less stretching.
It is noticed that our method also exhibits satisfactory robustness to noisy data. For instance, in Figure 8, although MOWA is trained on the synthetic dataset with clean masks, the warping results of the real-world dataset with noisy pseudo masks (the clean masks are not available in some practical applications) are still structurally reasonable.
IV-F Limitation Discussion
We show some failure cases in Figure 9. In these cases, we can find that the image boundaries are more irregular and the expected displacements of warping are more complicated than most samples. Consequently, it is challenging to approximate the accurate motion structure with a certain number of control points. This limitation could be addressed by adding more control points and cascading more TPS regression heads. Besides, scaling up the resolution of the input image could potentially improve the warping performance on image boundaries and details.
V Conclusion
We have proposed MOWA in this work, the first multiple-in-one image warping framework in the field of computational photography. It considers six representative and practical tasks in one learning model and uses a unified motion representation to achieve various warping purposes. In particular, to mitigate the difficulty of approximating diverse motions of different tasks, we propose to disentangle the motion estimation at both the region level and pixel level. Then, we enable MOWA the explicit task-aware ability by introducing a lightweight point-based classifier. Compared to the common image-based classifier, it can achieve comparable performance while offering significant parameter reduction. Subsequently, we feed the task label predicted by the task classifier into a prompt learning module and further modulate the feature maps in the decoder, which facilitates a single model to efficiently navigate and leverage its extensive parameter space to meet various warping requirements. Comprehensive experiments demonstrate that MOWA outperforms different SotA methods specifically designed for each single task, with an affordable model size. In the future, we plan to empower MOWA with cross-view and cross-modal abilities, aiming to build a foundation model for universal image warping.
References
- [1] H. S. Sawhney and R. Kumar, “True multi-image alignment and its application to mosaicing and lens distortion correction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 3, pp. 235–243, 1999.
- [2] R. Hartley and S. B. Kang, “Parameter-free radial distortion correction with center of distortion estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 8, pp. 1309–1321, 2007.
- [3] H. Feng, W. Wang, J. Deng, W. Zhou, L. Li, and H. Li, “Simfir: A simple framework for fisheye image rectification with self-supervised representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 418–12 427.
- [4] W. Wang, H. Feng, W. Zhou, Z. Liao, and H. Li, “Model-aware pre-training for radial distortion rectification,” IEEE Transactions on Image Processing, 2023.
- [5] K. He, H. Chang, and J. Sun, “Rectangling panoramic images via warping,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, pp. 1–10, 2013.
- [6] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Deep rectangling for image stitching: a learning baseline,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5740–5748.
- [7] D. Li, K. He, J. Sun, and K. Zhou, “A geodesic-preserving method for image warping,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 213–221.
- [8] Y. Zhang, Y.-K. Lai, and F.-L. Zhang, “Content-preserving image stitching with piecewise rectangular boundary constraints,” IEEE transactions on visualization and computer graphics, vol. 27, no. 7, pp. 3198–3212, 2020.
- [9] J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 8, pp. 1335–1340, 2006.
- [10] J. P. Barreto and H. Araujo, “Geometric properties of central catadioptric line images and their application in calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1327–1333, 2005.
- [11] D. Herrera, J. Kannala, and J. Heikkilä, “Joint depth and color camera calibration with distortion correction,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, pp. 2058–2064, 2012.
- [12] S. Ramalingam and P. Sturm, “A unifying model for camera calibration,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1309–1319, 2016.
- [13] K. Liao, L. Nie, S. Huang, C. Lin, J. Zhang, Y. Zhao, M. Gabbouj, and D. Tao, “Deep learning for camera calibration and beyond: A survey,” arXiv preprint arXiv:2303.10559, 2023.
- [14] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
- [15] X. Liu, Y. Zhao, and S.-C. Zhu, “Single-view 3d scene reconstruction and parsing by attribute grammar,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 710–725, 2017.
- [16] C. Häne, C. Zach, A. Cohen, and M. Pollefeys, “Dense semantic 3d reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1730–1743, 2016.
- [17] S. Kumar, Y. Dai, and H. Li, “Superpixel soup: Monocular dense 3d reconstruction of a complex dynamic scene,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1705–1717, 2019.
- [18] J. Ping, Y. Liu, and D. Weng, “Comparison in depth perception between virtual reality and augmented reality systems,” in 2019 IEEE conference on virtual reality and 3d user interfaces (VR). IEEE, 2019, pp. 1124–1125.
- [19] S. Thanyadit, P. Punpongsanon, and T.-C. Pong, “Investigating visualization techniques for observing a group of virtual reality users using augmented reality,” in 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2019, pp. 1189–1190.
- [20] A. Genay, A. Lécuyer, and M. Hachet, “Being an avatar “for real”: a survey on virtual embodiment in augmented reality,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 12, pp. 5071–5090, 2021.
- [21] K. Liao, L. Nie, C. Lin, Z. Zheng, and Y. Zhao, “Recrecnet: Rectangling rectified wide-angle images by thin-plate spline model and dof-based curriculum learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 10 800–10 809.
- [22] H. Feng, W. Zhou, J. Deng, Y. Wang, and H. Li, “Geometric representation learning for document image rectification,” in European Conference on Computer Vision, 2022, pp. 475–492.
- [23] H. Feng, W. Zhou, J. Deng, Q. Tian, and H. Li, “Docscanner: robust document image rectification with progressive learning,” arXiv preprint arXiv:2110.14968, 2021.
- [24] K. He, H. Chang, and J. Sun, “Content-aware rotation,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 553–560.
- [25] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical wide-angle portraits correction with deep structured models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3498–3506.
- [26] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-supervised wide-angle portraits correction by multi-scale transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
- [27] F. L. Bookstein, “Principal warps: Thin-plate splines and the decomposition of deformations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 6, pp. 567–585, 1989.
- [28] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using invariant features,” International journal of computer vision, vol. 74, pp. 59–73, 2007.
- [29] C.-C. Lin, S. U. Pankanti, K. Natesan Ramamurthy, and A. Y. Aravkin, “Adaptive as-natural-as-possible image stitching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1155–1163.
- [30] F. Zhang and F. Liu, “Parallax-tolerant image stitching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3262–3269.
- [31] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic radial distortion rectification using conditional gan in real-time,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 3, pp. 725–733, 2019.
- [32] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib: A deep learning approach for automatic intrinsic calibration of wide field-of-view cameras,” in Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production, 2018, pp. 1–10.
- [33] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 286–301.
- [34] M. Liu, X. He, and M. Salzmann, “Geometry-aware deep network for single-image novel view synthesis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4616–4624.
- [35] I. Daribo and B. Pesquet-Popescu, “Depth-aided image inpainting for novel view synthesis,” in 2010 IEEE International workshop on multimedia signal processing. IEEE, 2010, pp. 167–170.
- [36] H. Zhou, Y. Zhu, X. Lv, Q. Liu, and S. Zhang, “Rectangular-output image stitching,” in 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 2800–2804.
- [37] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively complementary network for fisheye image rectification using appearance flow,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6348–6357.
- [38] W. Yan, R. T. Tan, B. Zeng, and S. Liu, “Deep homography mixture for single image rolling shutter correction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9868–9877.
- [39] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter unrolling network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5941–5949.
- [40] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the shutter: Cnn to correct motion distortions,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2017, pp. 2291–2299.
- [41] Y. Shih, W.-S. Lai, and C.-K. Liang, “Distortion-free wide-angle portraits on camera phones,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, pp. 1–12, 2019.
- [42] Z. Liao, W. Zhou, and H. Li, “Dafir: Distortion-aware representation learning for fisheye image rectification,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- [43] N. S. Detlefsen, O. Freifeld, and S. Hauberg, “Deep diffeomorphic transformer networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4403–4412.
- [44] I. Rocco, R. Arandjelovic, and J. Sivic, “Convolutional neural network architecture for geometric matching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6148–6157.
- [45] W. Li, Y. Lu, K. Zheng, H. Liao, C. Lin, J. Luo, C.-T. Cheng, J. Xiao, L. Lu, C.-F. Kuo et al., “Structured landmark detection via topology-adapting deep graph learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 2020, pp. 266–283.
- [46] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
- [47] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
- [48] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
- [49] Z. Wang, X. Cun, J. Bao, W. Zhou, J. Liu, and H. Li, “Uformer: A general u-shaped transformer for image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 17 683–17 693.
- [50] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Deep rotation correction without angle prior,” IEEE Transactions on Image Processing, vol. 32, pp. 2879–2888, 2023.