3.1. The E-TCN Framework
Formally, the objective in this work is to build a model that receives as input a fixed number of images from a spatial time series and estimates the same number of images shifted by one time step into the future, effectively performing a one-step look ahead prediction. The constant number of the input and output images is a hyperparameter called timesteps. Thus, the input data have to be of size timesteps (N) × width (M) × height (M) × number of channels (c), while the output data had to be of size timesteps (N) × width (M) × height (M) × number of filters (f).
A typical approach when dealing with temporal data, i.e., timeseries, is to employ Long Short Term Memory (LSTM) networks. The inability of typical LSTM networks to take spatial correlations into account was the inspiration for the ConvLSTM model. It was explicitly designed in order to capture both the spatial and temporal dependencies of the dataset. Thus, at the LSTM equations ([
32]), matrix multiplications are replaced by convolution operations in the input-to-state and state-to-state transitions. Then, ConvLSTM equations result as below, where ∗ represents the convolution operator and ∘ represents the Hadamard, i.e., element-wise, product:
In this work, in order to encode information from both spatial and temporal dimensions, we propose the E-TCN, a deep learning model that combines three different networks, an encoder network, a Temporal Convolutional Network inspired from [
28] and a decoder network. A high-level visualization of the proposed E-TCN is presented in
Figure 1. Formally, the encoder network receives as input a single image, thus its input data is 3-dimensional,
. To insert multiple images, corresponding to different time-instance, into the model, i.e., to add the dimension of
timesteps to the input data, the Time Distributed layer of the deep learning framework Tensorflow was wrapped around the encoder network.
Figure 1 shows that each of the model’s input images
at time
, respectively, where
N represents the hyperparameter
timesteps, passed through an encoder network. To that end, the encoder network consists of three blocks, each block consisting of a 2D convolutional layer followed by a max pooling layer. The role of the max pooling layers is to return the maximum value from a
pixels image region. The encoder network output
N 3D vectors. Each of these
N 3D vectors is passed through a flatten layer which converted them to a set of
N 1D vectors, preserving the time dimension.
To help in the exposition of the core ideas of the E-TCN, assume that the input is single-channel images of pixels, the dimensions of the 2D convolutional kernels is , and the number of filters in the 2D convolutional layers are and C, for the first, second and third block, respectively. The dimensions of the encoder’s input is . Thus, the dimensions of the output of the first convolutional layer becomes (for valid padding). The output of the first max pooling layer is thus . If needed, the number of pixels is rounded down to the smallest integer. The output of the second convolutional layer is and the output of the second max pooling layer is . The output of the third convolutional layer is , which is also the final output of the encoder network.
The encoder network is followed by a Temporal Convolutional Network (TCN) architecture that replaces 1D convolutional layers with residual blocks. The internal structure of the residual block (left block of
Figure 1), which was proposed in [
28], consists of two 1D convolutional layers of the same kernel size and number of output filters. Each of them is followed by a rectified linear unit activation and a dropout layer. Furthermore, weight normalization ([
33]) is applied to the filters of the convolutions. In general, when we use a 1D convolutional layer, its input is a sequence of
vectors. Thus, the
N 1D vectors which are generated by the flatten layer act as the input of the first residual block of the TCN and subsequently propagated through the different layers. The dimension of each input vector is equal to the number of input channels at the first 1D convolutional layer inside the TCN. The essential point of a residual block is that its input is added invariably to its output. This sum goes through an activation, in our case a rectified linear unit activation, giving the final output of the residual block. The E-TCN that is shown in
Figure 1 consists of a single residual block, however, in general may consist of multiple layers.
Unlike typical convolutional networks, the E-TCN uses only causal convolutions, i.e., convolution operations that depend only on past and current information, therefore forcing the prediction at time
t,
, to depend only on the model’s inputs
. The predicted image at time
t therefore only depends on the images at times
. This is reflected in
Figure 1, where the kernel is represented with purple lines between the 2 convolutional layers. As in [
28], we also used dilated convolutions. The given input
of a dilated convolution operation
F on a component
s of a layer contains defined gaps of size
d between its elements. The normal convolution is applied for
. The relationship that describes this operation is given by:
where
k is the filter size and
is the
i element of the applied filter. As it is widely used, the dilation size grew exponentially by 2 at each residual block added to the network. The outputs of the TCN model were
N 1D vectors at
. The dimension of each output vector is equal with the number of output filters of the last residual block of the TCN model. We used a reshape layer to turn these 1D vectors to a set of
N 3D vectors and then a decoder network to turn the 3D vectors to our predicted images.
The decoder network followed the reverse process of the one in the encoder network. It consists of 3 2D transposed convolutional layers between which there were 2 batch normalization layers. The last layer of the decoder network is a 2D convolutional layer with 1 filter, because in this work we only used 1-channel images. The Time Distributed layer is wrapped around the decoder network as before.
We return to our previous example. We suppose that the size of the 2D convolutional kernels at the transposed convolutional layers and the convolutional layer is and the number of filters at the three 2D transposed convolutional layers is and A, respectively. In order to predict images of size using the above-mentioned structure of the decoder, the output of the reshape layer must be equal to . Thus, the output of the TCN is , where is the number of output filters of the last residual block of the TCN model and is set by us. The output of the first transposed convolutional layer is . The output of the second transposed convolutional layer is and the output of the third transposed convolutional layer is . The output of the final 2D convolutional layer is of size .
3.2. Analysis Ready Dataset
We quantify the performance of the proposed scheme by measuring the accuracy in the estimation of essential climate variables. Specifically, the proposed (E-TCN) and the state-of-the-art method (ConvLSTM) are trained using time series of historical observations and learn to predict future values of Land Surface Temperature and Surface Soil Moisture, derived by compositing and averaging the daily values from the corresponding month. The datasets considered in this work were created from single channel satellite derived products obtained from the NASA worldview application (
https://worldview.earthdata.nasa.gov/, accessed on 1 June 2021).
For each experiment, we split the full dataset into two separate sets, the training and the test dataset. These dataset consists of images representing values of daytime land surface temperature from January 2003 to May 2020 and for soil moisture from April 2015 to May 2020, respectively. For both cases, the objective is the prediction of the corresponding values for June of 2020. We compiled four datasets to test the ability of the models to predict land surface temperature values, each consisted of 210 examples, and more specifically:
A set of
pixel images with per pixel resolution equal to 5 km. These images were acquired from the region in Idaho shown in
Figure 2.
A set of
pixel images with per pixel resolution equal to 5 km. These images were acquired from the region in Sweden shown in
Figure 3.
A set of
pixel images with per pixel resolution equal to 1 km. These images were acquired from the region in Sweden shown in
Figure 4.
A set of
pixel images with per pixel resolution equal to 1 km. These images were acquired from the region in USA shown in
Figure 5.
Additionally, we used two datasets to test the ability of the two models to predict values of the soil moisture. Each of them consisted of 15 a set of
pixel images with per pixel resolution equal to 5 km encoding surface soil moisture. These images were acquired from two regions in USA, Idaho and Arkansas shown in
Figure 6.