2.2. Cloud Masking
Cloud masking of scenes that include sea ice with high accuracy and coverage remains a substantial open problem for imagery from almost all EO satellites including the MISR satellite. Production-level cloud masks from this mission which are applicable to the cryosphere are of two natures, the Stereoscopically Derived Cloud Mask (SDCM) and the Angular Signature Cloud Mask (ASCM). The SVM scene classifier is currently at provisional-quality designation and was used for purposes of comparative illustration only.
The SDCM (
Figure 2a) identifies clouds using an algorithm to measure parallax via camera-to-camera image matching from a combination of viewing angles [
37], implemented to retrieve the altitude of the radiance reflecting layer, as such, pixels are denoted as either having originated from the surface of the ellipsoid (cloud-free), or from a higher altitude (both at high and low confidence). The limitations of this method mean that clouds are only retrieved at an altitude greater than 560 m [
38], since below this corresponds to sub-pixel resolution, and only clouds with a high-enough optical thickness to have persistent features for a stereo-matching algorithm are detected. Two SDCM products are currently operational, based on differing stereo-matching algorithms for retrieval of stereo vectors [
39]. In this project, we refer to the more recent SDCM contained in MIL2TCCL (TC_CLOUD), except where explicitly referenced. Comparisons with CERES determined that optimal performance of the SDCM occurs when no retrieval is treated as clear [
40].
The ASCM (
Figure 2b) implements a band-differenced angular signature (BDAS) between the blue and red bands (over sea ice) as a function of viewing angle, with a comparison against threshold values to determine cloud classification. As such, pixels are denoted as either cloud or not cloud (both at high and low confidence). This mask performs well with high-altitude clouds, since the relative contribution of Rayleigh scattering exhibits a large differential between these two bands, and decreases with the lower path radiance associated with higher clouds. By comparison with expertly labelled scenes in the Arctic, the performance of the ASCM is assessed at 76% accuracy with 80% coverage [
42].
Another of the suite of instruments on board the Terra-1 platform is the MODIS (Terra) satellite. The MODIS satellite contains more spectral bands than the MISR sensor (36 compared to four) over a larger swath width of 2330 km, but only with a single nadir scan angle. As MISR and MODIS (TERRA) are both onboard the same instrument platform, there is near-simultaneous acquisition between these two sensors, and so MODIS (Terra) cloud masks are valid for MISR scenes and vice versa, subject to swath width constraints. The MOD35 cloud mask (
Figure 2f) contains 48 bits to allow the mask to be configured for a variety of end-use purposes; the displayed mask is high-confidence clear pixels where thin cirrus cloud has not been detected. From MOD35-CALIPSO hit rates, MOD35 has been demonstrated to have an accuracy in excess of 90% for the region of interest [
43]. Additionally shown is MOD29 (
Figure 2d), a sea-ice surface temperature and extent product. In MOD29, pixels determined ‘certain-cloud’ by MOD35 are removed, and presence of sea ice is determined using a combination of NDSI, band thresholding, and a land mask from MOD03 (a geolocation product).
Visual inspection of multiple scenes demonstrated that MOD29 and the aforementioned implementation of MOD35 are broadly consistent, except in darker conditions, as the NDSI is only determined under good light conditions. Both MODIS cloud masks can exhibit along-track striation artefacts propagated from insufficient quality L1B radiance retrievals. In this case, ‘no decision’ is returned and in this study we treat such pixels as not cloud free.
We also implemented a geometry-based cloud-shadow mask proposed by [
41] as similarly modified by [
44], defined in Equation (
2), requiring four data fields from MODIS-L1b-calibrated radiances (MOD021KM) (
Figure 2e).
The mask applied to each MISR scene is given by the addition of the MOD29 mask (
Figure 2d) with the derived cloud shadow mask (
Figure 2e). For MISR scenes used to construct the training dataset, any positive retrieval of cloud in the SDCM or ASCM (either high or low confidence) is also considered as cloud. This differential tolerance to cloud detection ensures that both the training datasets are as free from noise (misclassified cloud pixels) as practically possible without unnecessarily constraining some valid output retrievals.
2.3. Regression Modelling
An overview of our workflow for our methodology is outlined in
Figure 3. To construct the regression model, we generated a dataset of angular-reflectance signature components and contextual features with ground-truth measurements of surface roughness from IceBridge, where valid GIS intersections between MISR swaths and IceBridge paths is defined as spatially coincident reconnaissance occurring no more than 36 h apart. As the MISR angular-reflectance signatures and IceBridge elevation measurements are both optical sensors, the roughness we are quantifying and correlating corresponds to the air–snow interface.
IceBridge products pertaining to roughness are available at three product levels; Level 1B Elevation and Return Strength (ILATM1B) consisting of high-resolution individual elevation shots from the ATM lidar scanner, Level 2 Icessn Elevation, Slope, and Roughness (ILATM2) where elevation measurements from ILATM1B are aggregated to overlapping platelets approximately 80 m × 250 m (variable with reconnaissance altitude) with a three-platelet running mean average for smoothing, and Level 4 Sea Ice Freeboard, Snow Depth, and Thickness (IDCSI4) consisting of a 40 m gridded product. As the footprints of the overlapping platelets in ILATM2 are smaller than that of MISR, and IDCSI4 is experimental after 2013, we derive probability distribution functions (PDFs) of elevations within coincident 1.1 km MISR pixels from ILATM1B as shown in
Figure 4 and take the standard deviation (
) as roughness proxy. The criterion used to determine such elevation shots within a footprint of a given MISR pixel is the Euclidean distance of no more than half resolution (550 m) multiplied by
from the pixel centre of the MISR observation, given all inputs have been projected to the aforementioned equal area projection.
The standard deviation of the probability distribution functions of elevations yields spatially coherent roughness measurements for training purposes. We then defined an empirically derived minimum threshold of 30,000 elevation measurements within a 1.1 km MISR footprint for the derived roughness to be valid, as some intersections occur obliquely and, thus, are not representative. NSIDC Polar Pathfinder Daily Sea Ice Motion Vectors (NSIDC-0116) are used in conjunction with the reconnaissance interval to filter instances where the magnitude of coarse-resolution (25 km) displacement is in excess of half-a-pixel resolution. Where the ice motion vector is not retrieved, no filtering is applied, as these regions correspond to inhomogeneous surface types (ice and land) and are more likely to contain fast ice. After filtering, we have 19,367 training instances available to us to construct a regression model.
Prior to modelling, relative azimuth angles were calculated from solar and sensor azimuths, and solar zenith was encoded using a cosine function. Input features were scaled to unit variance with removal of the mean and the output feature (roughness) was scaled via a Box–Cox power transformation towards a normal distribution. Scaling ensures no single feature dominates the minimisation of an objective function in regression modelling. In this study, the metrics used for assessment of model performance are the mean absolute error (MAE) and the coefficient of determination, henceforth referred to as .
The processing workflow is implemented in Python. The MisrToolkit was used for pre-processing of MISR scenes. Machine-learning algorithms and performance assessments utilize scikit-learn, in conjunction with mlxtend for feature selection and Verde for blocked k-fold cross validation.
Support vector machines (SVM) are a common type of supervised-learning method which are employed for classification, regression analysis, and anomaly detection. Support vector regression, first described by [
45], is a type of SVM formulated for regression analysis. Support vector machines have been used successfully with the MISR satellite on previous occasions, notably for a scene classification model [
46] (
Figure 2c) and, in the case of regression, for the retrieval of leaf area index [
47].
When implementing an SVM for classification purposes, the objective is to determine an n-dimensional hyperplane with a surrounding margin that acts as a decision boundary in order to separate classes with a maximal margin, or street width, between them. When implemented for regression purposes, the objective is to determine a similar such hyperplane and surrounding margin containing a maximal number of points, which forms a surface representing a best-fit mapping between the input and output variables.
Let our training dataset be represented by Equation (
3), where input vectors are
and outputs are
for
k features and
N training instances given
.
The objective function for minimisation to implement a soft-margin SVR is the convex optimisation problem given by Equation (
4) (subject to Equation (
5)) [
48], where
is the magnitude of a vector normal to the hyperplane.
is a tunable hyperparameter which determines the width of the surrounding margin of the hyperplane such that, in optimisation, a prediction smaller than
suffers no penalty, termed the
insensitive loss function.
and
are positive slack variables allowing points to lie outside the surrounding margin of the hyperplane, thus adding robustness in the case of outliers. Associated with this is
C, another tunable hyperparameter (termed the regularization parameter) controlling the trade off between error and flatness of the resolved hyperplane, which can be thought of as a weighting for the slack variables [
49].
Subject to
is a nonlinear transformation which is used to map
to a higher dimensional space. Instead of computing this transformation directly, the kernel trick (Equation (
6)) is implemented to perform this transformation implicitly to reduce computational complexity [
49].
We implement a radial basis function kernel (RBF) (also known as a Gaussian function) (Equation (
7))
This introduces our final hyperparameter, , inversely proportional to the variance of the Gaussian function, which determines the sphere of influence of individual training instances.
One disadvantage of SVR is that model predictions do not give a probability interval in prediction, although we can quantify our overall model error through implementation of the MAE via cross validation. Hyperparameters must be manually yet appropriately defined to achieve a good model. The fit-time complexity of SVR is between quadratic and cubic depending on the number of samples; therefore, SVR rapidly becomes impractical to implement input samples.
We note that after investigating several regression methodologies in the early development of the project, we quickly converged towards the SVR approach as best suited to our processing. Other tried candidate models (not shown) exhibited a reduced performance and lower test–train convergence indicative of overfit. In addition, the soft margin SVR approach gives resistance to noise, such as training pixels incorrectly flagged as clear when cloudy.
2.4. Cross Validation
Cross validation is an important method to quantify the performance of the predictive behaviour of a given model. Such assessments quantify how a model will generalize to new or unseen data and is critical to prevent overfitting. The general principle of this method is to suppress a proportion of the data used to construct the model, and to use the suppressed data as an independent dataset to assess model performance. One such method is known as k-fold cross validation, where a dataset is split into a number of classes, known as folds, and for every fold, performance is assessed on a model constructed on the remaining folds. The performance metric from each fold is then averaged to denote an overall performance or cross-validation score. Underlying spatiotemporal correlations caused by a dataset comprised of adjacent incidents can introduce bias in cross-validation scores when using k-fold cross validation. This can result in overestimation of model performance [
50]. Leave-one-group-out (LOGO) cross validation, where the permutation of split is defined using categorical data for grouping, was considered as a potential alternative cross-validation scheme with combinations of both the year, or the region (using the NSIDC regional mask, see
Figure A1) in which an observation is considered as the category used for fold aggregation; however, this approach was rejected due to the formation of unbalanced classes which caused bias in metric aggregation during cross-validation assessment.
Instead, model performance was assessed using blocked k-fold cross validation, a scheme that is both respective of spatiotemporal correlations and allows for formation of balanced classes. A 100 km grid is constructed and overlain on the training data such that each incident is assigned a membership of a particular block. These blocks are then randomly aggregated to form some number of classes, such that the classes contain roughly equal numbers of training incidents, which are then used to determine the split permutation (
Figure 5). Overall performance is then derived from a mean average of each performance metric as determined by each fold.
A forward-floating sequential feature-selection scheme was implemented to determine the optimal features to include in the model. The feature subset is iteratively constructed by assessing model performance through cross validation on initially each single feature, with the optimum feature then selected and additional features included on the same basis. The floating element describes the process where previously included features can be excluded or replaced at each round if this is found to improve model performance. Blue bands are excluded from modelling due to uncorrected Rayleigh scattering. To avoid overfitting from contextual features, some features, such as geolocation information and redundant orbit parameters, are not included in the feature-selection process. The features available for modelling were the camera angles
Cf,
Af,
An,
Aa,
Ca for the red and NIR bands, solar zenith angle, relative azimuth angle for the aforementioned cameras, and ice-surface temperature (from MODIS). Presented in
Figure 6 is the result of this feature-selection process. As hyperparameterisation affects the model implemented for feature selection (
is particularly sensitive to the number of input features) and, conversely, feature selection impacts the model implemented for hyperparameterisation, these two processes should not be thought of as two distinct steps which follow each-other, but rather as a concurrent process which converges on some combination of both features and defined hyperparameters for optimal trade off between bias and variance. In
Figure 6 and
Figure 7, we present this final model.
Model performance with increasing features plateaus after approximately eight features, with a maximum at 10 features. These 10 features selected for modelling are Aa, An, Ca (red channel), An, Af, Ca (NIR channel), solar zenith angle, relative azimuth angle for Ca and Cf, and ice-curface temperature.
The support vector regression (SVR) analysis chosen requires optimal tuning of the three hyperparameters, the cost function (C), the insensitive loss (
), and the kernel parameter (
), which together determine the shape of the hyperplane. It is, therefore, critical for the development of a successful model that appropriate hyperparameters are selected. Grid-search cross validation describes the process of computing model performance over a comprehensive range of hyperparameters, termed a grid. This process is performed over an initially coarse grid of several orders of magnitude (
Figure 7) for each hyperparameter. Once the optimal order of magnitude for each hyperparameter is determined from the coarse grid, this process is then repeated on a finer grid to determine precisely the optimal combination of hyperparameters. Hyperparameters of C = 0.8,
= 0.4,
= 0.05 were selected.
Learning curves are common tools for quantifying bias–variance trade off in modelling, in order to prevent overfitting and to identify underfitting. The objective is to compare how well a particular model generalises to itself ( train, or training score) compared with how well it generalises to unseen data ( test, or cross-validation score) over an increasing fraction of the input data used for model construction. Convergence occurs at the point where our model has fully generalized to unseen data, indicating that overfitting has been avoided. The model performance at the point of convergence defines the potential extent of underfitting.
The learning curve of our model after selection of appropriate hyperparameters and features as described is presented in
Figure 8, illustrating model performance (from blocked k-fold cross validation, where
) for two metrics (
, and MAE) for increasing training fractions in 10% increments of the available data, with shaded
errors from aggregation of scores over folds. Whilst the training scores are better (towards 1 for
, towards 0 for MAE) for lower training fractions, the difference between the training and testing performance is sizeable, indicating an over-fit and poor generalisation to new data. As the training fraction increases, the difference between training scores and testing scores decreases and plateaus at similar values around 12,000 data points for both metrics.
The cross-validated performance of our model using the specified hyperparameters and features is given in
Table 3. The permutation of split for each fold in cross validation is illustrated in
Figure 5. MAEs presented are transformed into the original parameter space and exhibit convergence towards the 3 cm error intrinsic to the IceBridge ATM.
Our processing chain was then applied to MISR imagery from 2000–2020 for April, in order to construct a multi-decadal pan-Arctic-surface-roughness archive. In applying our model, we flagged and removed any derived measurements of surface roughness where any feature of the input vector lies outside the range of the trained parameter space denoted by deviations of ±3.