Abstract
Human gaze is a crucial cue used in various applications such as human-robot interaction, autonomous driving, and virtual reality. Recently, convolution neural network (CNN) approaches have made notable progress in predicting gaze angels. However, estimating accurate gaze direction in-the-wild is still a challenging problem due to the difficulty of obtaining the most crucial gaze information that exists in the eye area which constitutes a small part of the face images. In this paper, we introduce a novel two-branch CNN architecture with a multi-loss approach to estimate gaze angles (pitch and yaw) from face images. Our approach utilizes separate fully connected layers for each gaze angle prediction, allowing explicit learning of discriminative features and emphasizing the distinct information associated with each gaze angle. Moreover, we adopt a multi-loss approach, incorporating both classification and regression losses. This allows for joint optimization of the combined loss for each gaze angle, resulting in improved overall gaze performance. To evaluate our model, we conduct experiments on three popular datasets collected under unconstrained settings: MPIIFaceGaze, Gaze360, and RT-GENE. Our proposed model surpasses current state-of-the-art methods and achieves state-of-the-art performance on all three datasets, showcasing its superior capability in gaze estimation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Eye gaze is one of the essential cues used in a large variety of applications, including human-robot interaction [1,2,3], driver engagement [4,5,6,7,8] and augmented reality [9, 10]. Consequently, researchers have developed multiple techniques and approaches to accurately estimate the direction of human gaze. These methods can be broadly classified into two categories: model-based and appearance-based methods. Model-based methods typically rely on specialized hardware, which restricts their usage in unconstrained environments. Conversely, appearance-based methods directly infer the human gaze from images captured by readily available and inexpensive off-the-shelf cameras, enabling their implementation in different locations with unrestricted settings.
Convolutional Neural Networks (CNNs) have emerged as a powerful tool in computer vision [11,12,13,14,15], delivering impressive results across a range of tasks including gaze estimation. According to existing literature, two commonly used CNN approaches estimate gaze based on appearance. The first approach is based on extracting crucial gaze features from both the face and eye patches [16,17,18,19,20,21,22,23,24,25,26]. However, these methods often suffer from increased computational costs as they require three separate feature extractors: one for the face and two for the eyes. In contrast, the second approach aims to extract gaze-related features solely from the entire face [27,28,29,30,31,32]. However, these methods tend to have lower accuracy because the eye area, which contains the most significant gaze features, occupies only a small portion of the face images.
Gaze estimation typically employs regression loss functions such as mean square error (MSE) [18, 19, 33] and mean absolute error (MAE) [28, 34, 35] to predict continuous gaze angles. However, relying solely on regression loss in gaze estimation may struggle to enforce constraints or assign higher importance to desired regions especially the eye regions, which contain the most crucial gaze information, resulting in suboptimal performance.
Recent appearance-based methods [16,17,18,19,20,21] have attempted to estimate gaze angles, namely pitch and yaw, by directly regressing these angles using a single fully connected layer. However, this may limit the capacity of the model to capture complex relationships and variations in the gaze data, resulting in performance degradation. Furthermore, the model may struggle to capture the unique characteristics and variations associated with each gaze angle individually as the network has to learn to map both angles simultaneously.
To overcome these limitations, we propose a novel network to estimate gaze angles from face images. Our approach leverages the strength of dual fully connected layers and combines classification and regression losses to improve the accuracy and robustness of gaze estimations. In our network, we introduce separate fully connected layers for predicting each gaze angle, namely pitch and yaw. This enables explicit learning of independent features and emphasizes the distinct information associated with each angle. The separation of these layers enhances feature separability and allows for flexible parameter adjustments to accommodate variations in angle ranges and representation requirements.
Additionally, we employ combined classification and regression losses for each separate angle prediction. The joint optimization of classification and regression objectives through the combined losses for each gaze angel facilitates the extraction of more informative gaze features, resulting in improved overall performance and generalization of the gaze estimation model. We perform gaze classification by utilizing a softmax layer along with a cross-entropy loss to obtain coarse gaze direction. On the other hand, we get fine-grained predictions by calculating the expectation of the gaze bin probabilities followed by a gaze regression loss.
Our key contributions are as follows:
-
We propose a novel two-branch CNN architecture to predict gaze angles separately using two fully connected layers. This enables explicit learning of discriminative features associated with each gaze angle.
-
We utilize a multi-loss function comprising classification and regression losses to enable a joint optimization of the gaze angles. Using classification loss yields coarse gaze direction while still providing fine-grained prediction through the regression loss.
-
To address the issue of computational cost, we employ a CNN-based model that operates solely on face images, which needs only one feature extractor. Besides, we use an efficient CNN backbone with a reasonably low computational cost.
-
We demonstrate the superiority of our method in gaze performance and robustness in multiple experiments using several datasets.
-
We conduct an ablation study on the architecture of our proposed network, CNN backbones, and different loss functions to confirm the contribution of our proposed method.
2 Related work
Appearance-based gaze estimation approaches rest on learning the nonlinear mapping function from image intensities to the human gaze. Early methods usually learn a person-specific mapping function to the human gaze, e.g., linear interpolation function [36], adaptive linear regression [37], and gaussian process regression [38]. These methods show reasonable accuracy in a constrained environment (e.g., subject-specific, fixed illumination). However, they significantly degrade when tested on unconstrained settings.
CNN-based gaze estimation methods [39,40,41] have gained more interest as they can model a highly nonlinear mapping function from image to gaze. Zhang et al. [34] first proposed a simple VGG CNN-based architecture to predict gaze using a single-eye image. Moreover, they designed a spatial weights CNN in [27] to give more weight to regions of the face that have crucial gaze information. Krafka et al. [16] proposed a multichannel network that takes eye images, full-face images, and face grid information to obtain crucial gaze information. Chen et al. [19] adopted dilated convolutions to make use of the high-level features extracted from images without decreasing spatial resolution.
Fischer et al. [18] proposed a gaze estimation method that incorporates the head pose vector and features extracted using VGG-Net from eye crops to predict accurate gaze angles. Cheng et al. [20] proposed FAR-Net, which leverages the two-eye asymmetry property for estimating 3D gaze angles. They assign asymmetric weights to the loss functions of each eye and aggregate them for improved performance. Wang et al. [42] adopted ID-ResNet that contains a residual neural network structure with an embedding layer of personal identity to overcome the different geometric parameters between subjects. Some approaches [43,44,45,46] improve gaze estimation by alleviating the issue of inter-person gap or personal calibration.
In [47], they combine statistical models with deep learning by introducing a mixed effect model that integrates information from statistics within CNN architecture. Wang et al. [48] integrated adversarial learning with the Bayesian approach in one framework, which demonstrates an increased gaze generalization performance. Cheng et al. [49] proposed a coarse-to-fine adaptive network (CA-Net) that first uses face image to predict primary gaze angles and adapt it with the residual estimated from eye crops. Wu et al. [50] leverages modulated features and self-learning mechanisms to improve the accuracy of gaze estimation.
Kellnhofer et al. [29] used a temporal model (LSTM) with a sequence of 7 frames to predict gaze angles from the face images. Moreover, they adopt the pinball loss to jointly regress the gaze direction and error bounds together to improve gaze accuracy. Zhou et al. [51] explores the relationship between eye context and gaze estimation, employing metric learning techniques to enhance the model’s performance. Yun et al. [52] leverages high-frequency attentive super-resolution techniques to achieve accurate gaze estimation in challenging low-resolution scenarios.
Zhu et al. [53] model leverages a dual-branch architecture to capture both global and local features from low-resolution images, enabling accurate gaze estimation. Chen et al. [54] combine advanced network architectures and efficient calibration strategies to model the relationship between gaze and appearance features effectively. Oh et al. [55] introduce a novel approach that combines self-attention mechanisms with convolution and deconvolution operations to efficiently estimate eye gaze from a full-face image.
3 Method
3.1 Network architecture
A complete pipeline of our proposed approach is depicted in Fig. 1. We employ a CNN backbone followed by two fully connected layers, one specifically designed for estimating the pitch gaze angle and the other for the yaw gaze angle. This architecture enables us to explicitly learn discriminative features and emphasize the distinctive information associated with each gaze angle. In contrast to most previous work that utilize single regression loss, we employ combined classification and regression losses for each separate angle prediction. This enables a joint optimization of the network’s loss objectives, which facilitates the extraction of more crucial gaze information.
In order to adapt our network to the proposed loss function, we change the output dimension of each fully connected layer to predict a number of output angle classes (bins). The number of bins is based on the range of gaze angles of each dataset used for evaluation and the width of each bin. In our method, we divide the range of gaze angles into equal bins/intervals of 3 degrees each to ensure that each angular bin covers an equal range of gaze angles. We apply the nonlinear softmax layer to the output of each fully connected layer to convert the network bin logits to a probability distribution. Finally, we add our multi-loss function to penalize the network weights. The combination of softmax and cross-entropy loss function contributes to the network’s robustness and stability by encouraging the network to learn to classify and assign probabilities to different angle bins accurately. Furthermore, including the regression loss ensures fine-grained supervision by guiding the network to generate continuous gaze angles.
The most commonly used CNN backbones for the gaze estimation task are AlexNet [16, 27] and VGG16 [18, 19]. These backbones are quite large networks with a high computational cost in terms of both latency (inference time) and floating-point operations (FLOPs). However, as our focus is on addressing computational cost, we instead utilize the small version of the EfficentNetV2 [56]. EfficientNet models achieve efficiency gains by employing compound scaling and neural architecture search, which optimize both model size and computational requirements while maintaining competitive accuracy.
In addition, we streamline our approach by using only face images as input to the network. This eliminates the need for additional eye images, reducing both the computational overhead and overall computational cost associated with processing multiple input sources. By employing EfficientNetV2 as the backbone and solely relying on face images, we aim to strike a balance between computational efficiency and accurate gaze estimation, making our approach more feasible for real-world applications.
3.2 Proposed loss function
To penalize our network, we adopt two separate losses for each predicted gaze angle pitch and yaw. Each loss is based on a multi-loss approach that contains classification and regression components. The classification component of our loss function plays a crucial role in determining the bin (interval) to which the gaze angle belongs. By dividing the angle estimation into bins, we can effectively handle the inherent variability of gaze directions. This approach allows us to obtain a preliminary angle estimation by performing a classification task, effectively narrowing the possibilities to a specific interval.
Building upon the classification stage, we further refine the angle estimation through the regression component of our loss function. The regression component enables us to obtain a more precise and accurate estimation within the determined interval. By leveraging the strengths of regression, we iteratively refine the predicted angle, aligning it with the ground truth and ultimately obtaining the final angle estimation. This approach sets our methodology apart from existing gaze estimation techniques, as to the best of our knowledge, no prior work has explored this combined loss function for gaze estimation.
3.2.1 Classification loss
A convenient classification paradigm that suits the gaze estimation problem is multi-class classification. We perform the gaze classification part by utilizing a categorical cross-entropy loss. We use a softmax layer to convert the network bin logits into gaze bin probabilities. The softmax layer can be represented as follows:
where \(f(x_{i})\) is the softmax output value, \(x_i\) is the logits of bin i and b is the number of output bins. Then, we adopt a cross-entropy loss to compute the error between the output bin probabilities and gaze targets as follows:
where CE is the cross-entropy loss, \(t_i\) and \(f(x_{i})\) are the target ground truth and the softmax probability value for each bin i in b. In our case, we adapt one-hot encoding gaze targets, so only the positive bin \(i_p\) keeps its value in the loss calculation and discards the elements of the summation which have zero values. We can write the final gaze classification loss \(\mathcal {L}_{cls}\) as follows:
where \(x_{i_p}\) is the output probability for the positive bin.
3.2.2 Regression loss
We apply the regression loss to the continuous angle values to obtain fine-grained predictions with supervision. To obtain continuous angle values, we employ a two-step process. Firstly, we apply a nonlinear softmax layer to the output bin logits from the fully connected layers. This yields probabilities for each bin, indicating the likelihood of the gaze angle falling within that bin. Secondly, we calculate the expected value by multiplying each bin index by its corresponding bin probability and then sum up the products across all the bins as follows:
where \(p_i\) is the probability of the \(i'th\) bin and b is the number of bins. However, the range of the expected value is [1, b], which doesn’t directly correspond to the desired angle ranges of [-140, 140], [-40, 40], and [-20, 20] for Gaze360, RT-Gene, and MPIIFaceGaze, respectively. To tackle this issue, we add an adjustment to (4) as follows:
where \(\theta _{p}\) is the continuous gaze angle and w is the the bin width (3\(^\circ \) in our case). The purpose of the added part \(pi(i-(1+b)/2)\) is to map the values from the bin space directly to the angle space. By subtracting the index of the bin i from approximately half of the total number of bins \(((1+b)/a)\), you align the bin space with the corresponding angle space. Finally, we apply a regression loss to penalize the continuous gaze predictions. We test two regression losses that are commonly used in gaze estimation including the mean-squared error MSE which is defined as:
and the mean absolute error MAE which is defined as:
where n is the number of images and \(\theta _{t_i}\) and \(\theta _{p_i}\) are the target and predicted angle values for the sample i , respectively. Finally, we can write the final gaze regression loss \({L}_{reg}\) using MAE as follows:
3.2.3 Regression classification loss
Our proposed combined regression and classification loss (RCS-Loss) for each gaze angle pitch and yaw is a linear combination of the regression loss and classification loss, which is defined as:
where \(\mathcal {L}_{total}\) is the proposed loss function, \(\mathcal {L}_{cls}\) is the classification loss, \(\mathcal {L}_{reg}\) is the regression loss, and \(\alpha \) is the regression coefficient. During the experiments in Section 4, we empirically make \(\alpha =1\) seeking the best gaze performance.
4 Experimental results
4.1 Datasets
With the development of appearance-based gaze estimation methods, large-scale datasets have been proposed to improve gaze estimation performance. These datasets were collected with different procedures, varying from laboratory-constrained settings to unconstrained indoor and outdoor environments. To get a valuable evaluation of our network, we train and evaluate our model using three publicly available gaze datasets, including MPIIFaceGaze [34], Gaze360 [29] and RT-GENE [18]. The key properties and visualization of the three datasets are shown in Table 1 and Fig. 2.
MPIIFaceGaze
MPIIFaceGaze [34] provides 213.659 images from 15 subjects captured during their daily routine over several months. Consequently, it contains images with diverse backgrounds and lighting conditions that make it suitable for unconstrained gaze estimation. It was collected using software that asks the participants to look at randomly moving dots on their laptops. As the dataset covers different laptop models with different screen resolutions, they convert on-screen gaze positions to 3D vectors in the camera coordinate system. The original evaluation subset contains 45000 images in total with 3000 samples for each subject. The common way used for evaluating the dataset is by adapting a leave-one-subject-out with 15-fold cross validation [33, 34, 49].
Gaze360
Gaze360 [29] provides the widest range of head pose and gaze angles. It contains 172K images from 238 subjects of different ages, genders, and ethnicity. The dataset images are captured using a Ladybug multi-camera system in different indoor and outdoor environmental settings like lighting conditions and backgrounds, which make it efficient for unconstrained gaze estimation. The common way used for dataset evaluation is by dividing the Gaze360 dataset into training, testing, and validation sets.
RT-Gene
RT-Gene [18] consists of 122,531 samples of 15 subjects using wearable eye-tracking glasses. The data set was created with subjects located at 0.5 to 2.9 meters from the camera, which differs from the MPIIFaceGaze dataset creation, where participants are near their laptops. The dataset is captured indoors with high variation in gaze and head pose angles. Semantic in painting was used to replace eye-tracking glasses with skin texture. The dataset is divided into two subsets, including 13 subjects for training and two for verification. The common way used for dataset evaluation is by adapting a leave-one-subject-out with 3-fold cross-validation using the training subset [33, 49].
4.2 Data preprocessing
We follow the same procedures as in [57] to normalize the dataset images. In summary, this process applies rotation and translation to the virtual camera to remove the head’s roll angle and keep the same distance between the virtual camera and a reference point (the center of the face). To adapt to our proposed loss function, we create new bin labels for all dataset images used in our evaluation. Simply, we convert the continuous gaze targets for each angle (pitch and yaw) into bin labels with one-hot encoding based on the range of the dataset gaze annotations. We divide Gaze360 [29] into 90 bins with a bin width of 3-degree to cover the full range of [-140\(^\circ \), 140\(^\circ \)]. On the other hand, RT-Gene [18] and MPIIFaceGaze [34] are divided into 30 bins and 14 bins, respectively, with a bin width of 3 degrees. This division allows for coverage of the full range of [-40\(^\circ \), 40\(^\circ \)] and [-20\(^\circ \), 20\(^\circ \)] for RT-Gene and MPIIFaceGaze, respectively. Both datasets now include two types of target annotations: continuous and binned labels. This characteristic makes them well-suited for our combined regression and classification losses.
4.3 Training configuration
We adapt the ImageNet-1K [60] dataset for pre-training our network. We use EfficientNetV2-S as CNN backbone in our network. Our proposed network was trained in PyTorch framework using the Adam optimizer with a \(1e^{-5}\) learning rate. We train our proposed network for 50 epochs using a batch size of 16. We evaluate our proposed network on MPIIFaceGaze, Gaze360 and RT-GENE datasets using the evaluation procedures mentioned in Section 4.1.
4.4 Performance measures
We utilize gaze angular error (°) as the evaluation metric following most gaze estimation methods. Assuming the ground-truth gaze direction is g \(\epsilon \) \(\mathbb {R}^{3}\) and the predicted gaze vector is \(\hat{\textrm{g}}\) \(\epsilon \) \(\mathbb {R}^{3}\) , the gaze angular error (°) is computed by:
4.5 Comparison with the state-of-the-art methods
Table 2 presents a comparison of the mean angular error between our proposed network and state-of-the-art methods on the MPIIFaceGaze, Gaze360, and RT-GENE datasets. The table categorizes methods based on their input streams into two categories. The first category encompasses methods that utilize both face and eye images as input to their models, such as Dilated-Net [19]. The second category comprises approaches that solely employ face images as the input stream (e.g., Gaze360 [29]). Furthermore, the table illustrates various architectures, ranging from CNNs with standard backbones such as Alex-Net, VGG16-Net, and ResNet-50, to specially designed CNNs, and finally, to the transformer with self-attention. Notably, SAtten-Net, which incorporates a self-attention module, outperforms CNN-based approaches in terms of performance. From the table, our proposed network beats the self-attention network by achieving state-of-the-art gaze performance with a mean angular error of 3.86\(^\circ \), 10.12\(^\circ \), and 6.50\(^\circ \) on MPIIFaceGaze, Gaze360, and RT-GENE datasets respectively.
In order to provide a comprehensive evaluation of the proposed method in terms of computational efficiency, we conduct an experiment to evaluate the computational cost of the proposed network on a system with a single RTX-3080 GPU. We present Table 3, which compares the processing time, parameter count, and FLOPs for state-of-the-art gaze estimation methods. Each network was evaluated on the same target platform 10 times to get reliable results. The table clearly indicates that methods utilizing face and eye images (RT-Gene, Fare-Net, and AGE-Net) as input to their models incur high computational costs, whereas face-based gaze methods (ETH-Gaze, Gaze360 and Ours) consume lower computational resources. Interestingly, our proposed method exhibits the lowest FLOPs and processing time among the compared methods. Moreover, It is interesting to observe that despite having a higher number of parameters compared to the Gaze360 method (7-frames LSTM network), our proposed approach approximately achieves a 77% reduction in FLOPs and a 4% decrease in processing time. Additionally, it demonstrates an approximate 7% improvement in gaze performance when compared to Gaze360. This indicates that our method achieves an effective trade-off between model complexity and computational efficiency.
4.6 Visualization
We visualize some qualitative results of our model on various images from the three datasets as shown in Fig. 4. As the figure illustrates, our method is able to generate accurate gaze directions for various poses and lighting. To further analyze the effectiveness of our proposed approach, we visualize the Class Activation Map [61] of both normal gaze estimation method (consisting of one fully connected layer and MAE regression loss) and our proposed method (comprising two fully connected layers and a multi-loss approach) in Fig. 3. The visualizations clearly demonstrate that our method predicts gaze based on highly distinctive local regions, while effectively utilizing global features for both gaze angles. This observation verifies that our approach learns distinct subspaces for each gaze angle by employing separate fully connected layers. Additionally, the multi-loss approach assigns higher importance to the eye regions, which are crucial for capturing the most significant gaze information.
4.7 Ablation study
We conduct an ablation study to confirm the validity of our proposed model in improving gaze estimation performance. We conduct experiments on MPIIFaceGaze, Gaze360, and RT-GENE datasets to evaluate the reliability and generalization of our proposed approach.
4.7.1 Loss function
To evaluate the impact of our proposed RCS-Loss on the overall gaze performance, we conduct an experiment using different networks that share the same CNN backbone (EfficientNetV2) but employ different loss functions. Initially, we create two networks utilizing the conventional MAE and MSE loss functions, respectively. Subsequently, we introduce a third network that incorporates the cross-entropy (CE) loss without regression. We then compare the gaze performance of these networks against our proposed RCS-Loss. To ensure a fair comparison, when employing MAE or MSE as the loss function for the conventional networks, we also incorporate MAE as a regression component in our RCS-Loss function. The same principle applies when using the MSE loss.
Additionally, we utilize different hyperparameters to determine the optimal settings for achieving the highest gaze performance across the three networks. For instance, we found that the best gaze performance for conventional networks (using MAE and MSE losses) was obtained with a learning rate of \(1e^{-4}\). However, for our proposed network, a learning rate of \(1e^{-5}\) yielded the highest performance.
Table 4 presents a comparison between adapting the conventional loss functions and applying our proposed RCS-Loss. The results demonstrate an improvement in gaze performance by approximately 1\(^\circ \) in gaze performance by adopting our proposed loss function, indicating the effectiveness of the proposed loss compared to conventional losses. Additionally, incorporating MAE as a regression loss within our RCS-Loss yielded the highest gaze performance among the tested configurations.
4.7.2 Network architecture
In order to investigate the impact of our proposed network architecture, we examine the influence of predicting gaze angles using two separate fully connected layers on the overall gaze estimation performance. We introduce a network variant, referred to as Conventional-1F which shares the same backbone and loss function as our proposed model but utilizes only one fully connected layer for predicting gaze angles. In Table 5, we compare our proposed network, which incorporates two fully connected layers, with the Conventional-1F network. The results demonstrate that our proposed network improves the gaze performance in the three datasets compared to the Conventional-1F network. This finding highlights the advantages of learning the unique features associated with each gaze angle, ultimately leading to improved accuracy in gaze estimation.
4.7.3 Backbone
We analyze the impact of replacing our EfficientNet backbone in our method with two popular backbones used by previous methods, including ResNet-18 [29] and ResNet-50 [28]. Our results from Tables 4 and 5 already proved the superiority of our proposed approach using the same backbone. Nevertheless, we want to evaluate our method using different backbones used in other approaches. In Table 6, we compare our previous results with our proposed network trained with ResNet18 and ResNet-50 as backbones. It is remarkable that our model trained on approximately 50% smaller backbone ResNet18 still achieves state-of-the-art gaze accuracy on the three datasets than all other methods from Table 2. This confirms that our model’s overall performance is mainly accounted by our proposed network and loss function and hardly by the used backbone. Moreover, the model trained on ResNet50 improves the gaze performance compared with results reported by ETH-Net [28] that use the same backbone. Lastly, it is evident that our method utilizing EfficientNet outperforms ResNet50 in both performance and computational cost, despite having the same number of parameters (see Table 3).
5 Analysis
In this section, we provide an analysis of our proposed network with respect to individual subject evaluation and the number of gaze bins for the classification part of the proposed loss function.
5.1 Subject-wise analysis on MPIIGaze
As the previous experiments focus on the average performance across all subjects in the dataset, we will explore the robustness of our model toward the individual subjects in the MPIIFaceGaze dataset as it contains 15 subjects. We analyze the gaze performance of our proposed method on each subject of the MPIIGaze dataset. We present the gaze accuracy of each subject of the MPIIFaceGaze dataset and compare it with FARE-Net [20] as they presented the subject-wise gaze performance. Out of 15 subjects, our proposed method achieves better gaze accuracy for 11 subjects, as shown in Fig. 6. This finding highlights the effectiveness and robustness of our approach across individual subjects in the MPIIFaceGaze dataset.
5.2 Effect of classification bins
We have performed additional experiments to investigate the impact of varying the bin width on the overall gaze performance across the three datasets: MPIFaceGaze, Gaze360, and RT-Gene. We empirically adjust the bin width from 1 to 10 and evaluated the corresponding gaze performance, as depicted in Fig. 5. When analyzing the MPIFaceGaze and RT-Gene datasets, we observe that higher bin widths led to increased angular errors in the gaze performance, whereas smaller bin widths yielded slightly lower angular errors. On the other hand, for the Gaze360 dataset, both small and large bin widths resulted in slightly higher angular errors. Intriguingly, utilizing 3-degree bin width demonstrated the lowest angular error across all three datasets.
6 Limitations and future work
Extracting accurate gaze information from face images with unconstrained settings is challenging since the eyes, which contain the most crucial gaze information, constitute a small part of the image. Nevertheless, our performance assessment with extensive experiments has illustrated the effectiveness of our proposed model in estimating accurate gaze predictions with several datasets. Appearance-based gaze estimation approach depends mainly on the holistic appearance of the face and eyes, which limits the maximum pitch angle (gaze direction in horizontal) in all datasets [62] to \(\pm 80^\circ \). In addition, the maximum yaw angle (gaze direction in vertical) is \(\pm 80^\circ \) except for Gaze360 [29] \(\pm 140^\circ \) and ETH-XGaze [28] \(\pm 120^\circ \), however, these datasets contain only very few samples beyond \(\pm 80^\circ \).
With the development of new unconstrained datasets, the gaze yaw angle could be increased beyond this limit. In this case, our approach could have a common problem with the yaw angle by over-penalizing predictions \(>180^\circ \) where the gaze is similar, but regression loss produces extreme loss values. As a future solution to this limitation, penalizing the minimal rotation angle that is needed instead of penalizing the angle directly.
7 Conclusion
This paper presents a novel multi-loss CNN-based network to predict gaze direction directly from face images. We propose to predict each gaze angle individually with a separate fully-connected layers to capture the distinct features for each angle subspace. In order to obtain accurate gaze direction, we employ a multi-loss function for each gaze angle, which combines regression and classification losses. This joint optimization of classification and regression objectives enables the extraction of more informative gaze features, which help improve gaze performance. We use a reliable softmax layer along with cross-entropy loss to get rough gaze predictions. On the other hand, we accomplish gaze regression by calculating the expectation of the gaze class probabilities followed by a gaze regression loss. To show the robustness of our model, we validate our network using three of the most unconstrained gaze datasets: MPIIFaceGaze, Gaze360, and RT-GENE. Our model achieved state-of-the-art gaze performance with the lowest angular error in all datasets.
References
Hempel T, Al-Hamadi A (2020) Slam-based multistate tracking system for mobile human-robot interaction. In: International conference on image analysis and recognition, pp 368–376. Springer
Strazdas D, Hintz J, Khalifa A, Abdelrahman AA, Hempel T, Al-Hamadi A (2022) Robot system assistant (rosa): Towards intuitive multi-modal and multi-device human-robot interaction. Sensors 22(3):923
Abdelrahman AA, Strazdas D, Khalifa A, Hintz J, Hempel T, Al-Hamadi A (2022) Multi-modal engagement prediction in multi-person human-robot interaction. IEEE Access
Hu Z, Lv C, Hang P, Huang C, Xing Y (2021) Data-driven estimation of driver attention using calibration-free eye gaze and scene features. IEEE Trans Ind Electron 69(2):1800–1808
Vora S, Rangesh A, Trivedi MM (2018) Driver gaze zone estimation using convolutional neural networks: A general framework and ablative analysis. IEEE Trans Intell Veh 3(3):254–265
Ghosh S, Dhall A, Sharma G, Gupta S, Sebe N (2021) Speak2label: Using domain knowledge for creating a large scale driver gaze zone estimation dataset. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2896–2905
Palazzi A, Abati D, Solera F, Cucchiara R et al (2018) Predicting the driver’s focus of attention: the dr (eye) ve project. IEEE Trans Pattern Anal Mach Intell 41(7):1720–1733
Abbasi JA, Mullins D, Ringelstein N, Reilhac P, Jones E, Glavin M (2021) An analysis of driver gaze behaviour at roundabouts. IEEE Trans Intell Transp Syst 23(7):8715–8724
Wang Z, Zhao Y, Lu F (2022) Gaze-vergence-controlled see-through vision in augmented reality. IEEE Trans Vis Comput Graph 28(11):3843–3853
Clay V, König P, Koenig S (2019) Eye tracking in virtual reality. J Eye Mov Res 12(1)
Yang Y, Wei H, Zhu H, Yu D, Xiong H, Yang J (2022) Exploiting cross-modal prediction and relation consistency for semisupervised image captioning. IEEE Trans Cybern
Mo Y, Wu Y, Yang X, Liu F, Liao Y (2022) Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 493:626–646
Oza P, Sindagi VA, Sharmini VV, Patel VM (2023) Unsupervised domain adaptation of object detectors: A survey. IEEE Trans Pattern Anal Mach Intell
Zhang Q, Xu Y, Zhang J, Tao D (2023) Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. Int J Comput Vis 1–22
Yang Y, Zhan D-C, Wu Y-F, Liu Z-B, Xiong H, Jiang Y (2019) Semi-supervised multi-modal clustering and classification with incomplete modalities. IEEE Trans Knowl Data Eng 33(2):682–695
Krafka K, Khosla A, Kellnhofer P, Kannan H, Bhandarkar S, Matusik W, Torralba A (2016) Eye tracking for everyone. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2176–2184
Bao J, Liu B, Yu J (2022) An individual-difference-aware model for cross-person gaze estimation. IEEE Trans Image Process 31:3322–3333
Fischer T, Chang HJ, Demiris Y (2018) Rt-gene: Real-time eye gaze estimation in natural environments. In: Proceedings of the european conference on computer vision (ECCV), pp 334–352
Chen Z, Shi BE (2018) Appearance-based gaze estimation using dilated-convolutions. In: Asian conference on computer vision, pp 309–324. Springer
Cheng Y, Zhang X, Lu F, Sato Y (2020) Gaze estimation by exploring two-eye asymmetry. IEEE Trans Image Process 29:5259–5272
Park S, Spurr A, Hilliges O (2018) Deep pictorial gaze estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 721–738
Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., Torralba, A.: Eye tracking for everyone. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2176–2184 (2016)
Zhang Y, Yang X, Ma Z (2020) Driver’s gaze zone estimation method: A four-channel convolutional neural network model. In: 2020 2nd International conference on big-data service and intelligent computation, pp 20–24
Wang Z, Zhao J, Lu C, Yang F, Huang H, Guo Y, et al (2020) Learning to detect head movement in unconstrained remote gaze estimation in the wild. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3443–3452
Yu Z, Huang X, Zhang X, Shen H, Li Q, Deng W, Tang J, Yang Y, Ye J (2020) A multi-modal approach for driver gaze prediction to remove identity bias. In: Proceedings of the 2020 international conference on multimodal interaction, pp 768–776
Lian D, Hu L, Luo W, Xu Y, Duan L, Yu J, Gao S (2018) Multiview multitask gaze estimation with deep convolutional neural networks. IEEE Trans Neural Netw Learn Syst 30(10):3010–3023
Zhang X, Sugano Y, Fritz M, Bulling A (2017) It’s written all over your face: Full-face appearance-based gaze estimation. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference On, pp 2299–2308. IEEE
Zhang X, Park S, Beeler T, Bradley D, Tang S, Hilliges O (2020) Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In: European conference on computer vision, pp 365–381. Springer
Kellnhofer P, Recasens A, Stent S, Matusik W, Torralba A (2019) Gaze360: Physically unconstrained gaze estimation in the wild. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6912–6921
Lu F, Sugano Y, Okabe T, Sato Y (2015) Gaze estimation from eye appearance: A head pose-free method via eye image synthesis. IEEE Trans Image Process 24(11):3680–3693
Yamazoe H, Utsumi A, Yonezawa T, Abe S (2008) Remote gaze estimation with a single camera based on facial-feature tracking without special calibration actions. In: Proceedings of the 2008 symposium on eye tracking research & applications, pp 245–250
Chen J, Ji Q (2008) 3d gaze estimation with a single camera without ir illumination. In: 2008 19th International conference on pattern recognition, pp 1–4. IEEE
Biswas P, et al (2021) Appearance-based gaze estimation using attention and difference mechanism. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3143–3152
Zhang X, Sugano Y, Fritz M, Bulling A (2017) Mpiigaze: Real-world dataset and deep appearance-based gaze estimation. IEEE Trans Pattern Anal Mach Intell 41(1):162–175
Cheng Y, Lu F (2021) Gaze estimation using transformer. arXiv:2105.14424
Tan K-H, Kriegman DJ, Ahuja N (2002) Appearance-based eye gaze estimation. In: Sixth IEEE workshop on applications of computer vision, 2002. (WACV 2002). Proceedings., pp 191–195. IEEE
Lu F, Sugano Y, Okabe T, Sato Y (2014) Adaptive linear regression for appearance-based gaze estimation. IEEE Trans Pattern Anal Mach Intell 36(10):2033–2046
Williams O, Blake A, Cipolla R (2006) Sparse and semi-supervised visual mapping with the s\(^{\wedge }\) 3gp. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 1, pp 230–237. IEEE
Hu Z, Li S, Zhang C, Yi K, Wang G, Manocha D (2020) Dgaze: Cnn-based gaze prediction in dynamic scenes. IEEE Trans Vis Comput Graph 26(5):1902–1911
Lemley J, Kar A, Drimbarean A, Corcoran P (2019) Convolutional neural network implementation for eye-gaze estimation on low-quality consumer imaging systems. IEEE T Consum Electr 65(2):179–187
Wang W, Shen J, Dong X, Borji A, Yang R (2019) Inferring salient objects from human fixations. IEEE Trans Pattern Anal Mach Intell 42(8):1913–1927
Wang Q, Wang H, Dang R-C, Zhu G-P, Pi H-F, Shic F, Hu B-l (2022) Style transformed synthetic images for real world gaze estimation by using residual neural network with embedded personal identities. Appl Intell 1–16
Park S, Mello SD, Molchanov P, Iqbal U, Hilliges O, Kautz J (2019) Few-shot adaptive gaze estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9368–9377
Yu Y, Liu G, Odobez J-M (2019) Improving few-shot user-specific gaze adaptation via gaze redirection synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11937–11946
Liu G, Yu Y, Mora KAF, Odobez J-M (2019) A differential approach for gaze estimation. IEEE Trans Pattern Anal Mach Intell 43(3):1092–1099
Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2107–2116
Xiong Y, Kim HJ, Singh V (2019) Mixed effects neural networks (menets) with applications to gaze estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7743–7752
Wang K, Zhao R, Su H, Ji Q (2019) Generalizing eye tracking with bayesian adversarial learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11907–11916
Cheng Y, Huang S, Wang F, Qian C, Lu F (2020) A coarse-to-fine adaptive network for appearance-based gaze estimation. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 10623–10630
Wu Y, Li G, Liu Z, Huang M, Wang Y (2022) Gaze estimation via modulation-based adaptive network with auxiliary self-learning. IEEE Trans Circ Syst Vid Technol 32(8):5510–5520
Zhou J, Li G, Shi F, Guo X, Wan P, Wang M (2023) Em-gaze: eye context correlation and metric learning for gaze estimation. Vis Comput Ind Biomed Art 6(1):8
Yun J-S, Na Y, Kim HH, Kim H-I, Yoo SB (2022) Haze-net: High-frequency attentive super-resolved gaze estimation in low-resolution face images. In: Proceedings of the Asian conference on computer vision, pp 3361–3378
Zhu Z, Zhang D, Chi C, Li M, Lee D-J (2022) A complementary dual-branch network for appearance-based gaze estimation from low-resolution facial image. IEEE Trans Cognit Dev Syst
Chen Z, Shi BE (2022) Towards high performance low complexity calibration in appearance based gaze estimation. IEEE Trans Pattern Anal Mach Intell 45(1):1174–1188
O Oh J, Chang HJ, Choi S-I (2022) Self-attention with convolution and deconvolution for efficient eye gaze estimation from a full face image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4992–5000
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: International conference on machine learning, pp 10096–10106. PMLR
Zhang X, Sugano Y, Bulling A (2018) Revisiting data normalization for appearance-based gaze estimation. In: Proceedings of the 2018 ACM symposium on eye tracking research & applications, pp 1–9
Murthy L, Brahmbhatt S, Arjun S, Biswas P (2021) I2dnet-design and real-time evaluation of appearance-based gaze estimation system. J Eye Movement Res14(4)
Ghosh S, Hayat M, Dhall A, Knibbe J (2022) Mtgls: Multi-task gaze estimation with limited supervision. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3223–3234
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929
Ghosh S, Dhall A, Hayat M, Knibbe J, Ji Q (2021) Automatic gaze analysis: A survey of deep learning based approaches. arXiv:2108.05479
Funding
Open Access funding enabled and organized by Projekt DEAL. This work is funded and supported by the Federal Ministry of Education and Research of Germany (BMBF) (AutoKoWAT-3DMAt under grant Nr. 13N16336) and German Research Foundation (DFG) under grants Al 638/13-1, Al 638/14-1 and Al 638/15-1.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors claim no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Abdelrahman, A.A., Hempel, T., Khalifa, A. et al. Fine-grained gaze estimation based on the combination of regression and classification losses. Appl Intell 54, 10982–10994 (2024). https://doi.org/10.1007/s10489-024-05778-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05778-3