1. Introduction
The forest stock volume (FSV, m
3 ha
−1) is the sum of the stem volumes of all the living trees per unit area, and is one of key forest variables for forest resources management and assessments on local, region and country scales [
1]. FSVs have a strong relationship with the aboveground biomass (AGB) and carbon stocks [
2]. To understand the spatial distribution of carbon in forests and to derive predictions for monitoring carbon stock trends, the FSV must be quantified [
3]. Traditionally, the FSV is estimated by sampling several plots, which involves substantial manpower, materials, and financial resources [
4]. With the development of remote sensing technology, particularly the Landsat series since 1972, satellite imagery has played an important role in forest inventory. Various studies have been performed to estimate the forest variables using low spatial resolution (LSR, LSR ≥ 30 m), moderate spatial resolution (MSR, 5 m < MSR < 30 m), and high spatial resolution (HSR, HSR ≤ 5 m) [
5] data obtained by different optical sensors (e.g., Landsat [
6,
7,
8,
9,
10,
11], MODIS [
12,
13,
14,
15,
16], SPOT [
17,
18,
19,
20], Quickbird [
21,
22,
23,
24,
25,
26], RapidEye [
27,
28,
29,
30]), microwave sensors [
31,
32,
33,
34], and light detection and ranging (LiDAR) sensors [
35,
36,
37,
38,
39]. The successful applications of these technological tools have laid the foundation for the estimation of forest variables, such as the FSV, using remote sensing technology.
Two recent developments in the Earth Observation (EO) sector have increased the potential to improve the efficiency of retrieving global forest attributes. The Sentinel-2A and Sentinel-2B satellite series launched by the European Space Agency (ESA) through its Copernicus program in 2015 (S2A) and 2017 (S2B) provide nominal five-day revisit imagery across the globe [
40]. The Sentinel-2 imagery includes 13 spectral bands with spatial resolutions ranging from 10–60 m [
40]. One of the main purposes of the Sentinel-2 satellite series is vegetation analysis [
4]. The Sentinel-2 satellite images provided by the operational environment monitoring system based on the European Copernicus program may be accessed freely. The different spatial resolution bands, the short revisit period, and the rich spectral information have made these images a popular source of remote sensing data for forestry research in recent years. For instance, Persson et al. [
41] used Sentinel-2 data to classify common tree species in central Sweden, observing the highest overall accuracy, i.e., approximately 88.2%, using all the imagery bands in the final model. A study conducted by Ho’sciło et al. [
42] in southern Poland affirmed that the Sentinel-2 series could accurately delineate tree species (e.g., beech, oak, birch, alder, and larch) with an overall accuracy above 85%. Similarly, Pandit et al. [
43] explored the ability of Sentinel-2 images to estimate the forest biomass in Nepal; their biomass estimation model achieved an
R2 = 0.81 and RMSE = 25.57 t ha
−1. Zarco-Tejada et al. [
44] demonstrated the potential of Sentinel-2A data to estimate the chlorophyll content in open canopy conifer forests (
R2 > 0.7 for June;
R2 > 0.4 for December). In the Brazilian Amazon, Lima et al. [
45] performed comparative research on monitoring selective logging using Sentinel-2 and Landsat-8 OLI imagery. These authors found that Sentinel-2 data (43.2% detected) were more effective in detecting logging concessions than Landsat 8 data (35.5% detected). In Poland, Grabska et al. [
46] used the Sentinel-2 time series to map forest stand species in the Carpathian Mountains and reported higher accuracy, i.e., a 5–10% improvement in overall accuracy compared with only using single date imagery. These studies demonstrate the potential of Sentinel-2 for forest vegetation monitoring.
A typical traditional remote sensing image processing approach includes data downloading and local computer processing, which render image processing computationally demanding, and also hinder the processing capability for large datasets. Regarding imagery processing, the development of cloud computing technology has been fundamentally changing traditional remote sensing image processing. The Google Earth Engine (GEE) portal is a powerful cloud-based computing platform for image processing [
47,
48]. The GEE archives massive, publicly-available remote sensing data, provides a programming environment, programming tools, and virtual machines for users with relatively simple code, and can process imagery data online. The GEE greatly improves the processing efficiency when using substantial amounts of remote sensing data. In recent years, the GEE was used in land cover mapping [
49,
50,
51,
52,
53,
54,
55,
56,
57,
58], agricultural applications [
59,
60,
61,
62,
63], disaster management, and earth sciences studies [
64,
65,
66]. This remote sensing data processing cloud platform makes the rapid processing of Sentinel-2 images covering large areas possible.
In the FSV estimation field, some studies have examined assessments of the FSVs using remote sensing. For instance, Condés et al. [
67] found the model prediction of the plot-level growing stock volume using satellite images and field data to be useful; the result showed that the adj-
R2 increased from 0.19 to 0.42. Using the random forests (RF) regression algorithm, Chrysafis et al. [
68] estimated the FSV based on Sentinel-2 image, which provided relatively better results (
R2 = 0.63, RMSE = 63.11 m
3 ha
−1) than Landsat-8 OLI images (
R2 = 0.62, RMSE = 64.40 m
3 ha
−1). However, some studies were conducted that combined optical images and microwave data to estimate forest variables [
69,
70,
71]. A noteworthy study was conducted by Mauya et al. [
72]; these authors assessed the multiple linear regression (MLR) models built by Sentinel-1, Sentinel-2, and ALOS PALSAR-2 images to predict the FSV, and found that Sentinel-2 images performed best with an RMSEr = 42.03% and a pseudo-
R2 = 0.63. For predicting forest variables, Pham et al. showed that machine learning algorithms were likely to become more attractive in remote sensing [
73]. These authors suggest that future studies using more methods, large areas, and Sentinel-2 data to predict the FSV should be conducted.
However, ground plot survey data are still indispensable for remote sensing modeling [
74]. The costs of ground plot surveys have always been high, which has presented some obstacles to the estimation of the provincial FSV by remote sensing. In addition, the traditional sample location survey technology often produces serious positional deviations, which may impact the modeling accuracy of the plot data-based remote sensing estimations and predictions [
75]. Thus, traditional sample location technology is also an important reason for the inaccurate matching of sample plots and pixels, which results in estimation bias. Furthermore, to the best of our knowledge, little to no research has been conducted to compare the results of the RF, support vector regression (SVR), and MLR using Sentinel-2 images to predict FSVs. In addition, no FSV mapping has been conducted in Hunan Province, and this research gap directly affects forest policy making and management. Moreover, Hunan Province is located in southern China, where cloudy conditions frequently occur; these conditions are a great challenge to mapping the FSV in Hunan Province using remote sensing images.
Hence, Sentinel-2 data on the GEE platform, 459 sample plots, and three algorithms were used in this study to achieve the following objectives: (1) to identify and select the most important variables of the Sentinel-2 images for FSV estimation using plot-level tree measurements from a large number of in situ sites in Hunan Province, southern China, where forest species and stand structures are complex; (2) to assess and understand the performance of machine learning algorithms and the typical MLR models for FSV estimation; and (3) to map the FSV in Hunan Province. This study will help move towards the overall goal of developing and improving GEE-based remote sensing approaches to estimate the FSV on local, regional and national scales.
4. Discussion
The main purpose of this study was to evaluate the potential variables of the Sentinel-2 data using different algorithms to predict the FSV based on reliable field survey data, and to map the FSV for the first time in the southern province of China. The FSV is an important variable of forest management reports at the provincial and national levels. The use of free remote sensing imagery (e.g., Sentinel-2) and cloud processing platforms (e.g., GEE) to process and build prediction models for estimating and mapping the FSV is especially important in southern China. One of the reasons for this is that the area is covered by clouds many days each year, which seriously affects provincial forest mapping research that uses remote sensing images. Another reason is that the provinces of southern China are an important part of China’s forestry. Through this research, we can effectively use the Sentinel-2 data with the GEE platform and apply the RF algorithm to spatially map the FSV in Hunan Province. In addition, this approach is also an effective way to conduct forest carbon monitoring, which is sensitive and important to climate change.
Based on the spectral bands and vegetation indices extracted from Sentinel-2 data, this study has shown that B5 (Red-Edge 1) was the most important variable when estimating the FSV using both the machine learning methods and MLR method, which had been confirmed in recent studies concerning forest prediction [
102] and tree species classification [
103]. In the gross primary productivity field (GPP), Lin et al. found that the red-edge band was useful for estimating the GPP, and noted that the red-edge reflectance was sensitive to the leaf chlorophyll content [
104]. In addition, the leaf chlorophyll content was an important forest variable. Except for B5, the modeling variables selected by the machine learning models and the MLR were not the same. However, the accuracies of the different model verification results were not very different, which shows that B5 has a substantial advantage in estimating the FSV. In the study conducted by Chrysafis et al. [
68], when using the RF algorithm and Sentinel-2 images to estimate the FSV in a Mediterranean forest ecosystem, they found that the most important variable was B11 (SWIR 1), which was different from our findings in this study. Our research was consistent with a study conducted by Astola et al. [
4], which showed that B5 was the most important variable with which to predict the FSV. Regarding the FSV prediction performance, we compared our
R2 that measured our predictive capability with that of Astola et al. [
4], who conducted research using Sentinel-2 and multilayer perceptron and regression trees to estimate the FSV; we found that our results (
R2 = 0.58) were slightly better than their best results with a multilayer perceptron model (
R2 = 0.56). This observation may occur because the RF algorithm usually performs better than the multilayer perceptron [
71], or the sample survey data based on our advanced positioning technology improved the estimation accuracy. Regarding the variable selection in machine methods, our results also showed that the traditional vegetation indices did not perform well when estimating the FSV in this study, and Lu et al. [
105] found the same trend, i.e., that the original band performed better than vegetation indices when estimating the FSV.
Table 6 shows that the training and test results before and after selecting the variables of the RF modeling were basically the same, which indicated that the VSURF package was a good tool with which to select variables for RF modeling. However,
Table 6 also shows that the selected variables had a certain impact on SVR modeling. This is because the VSURF package is a variable selecting tool based on the RF model, and is not fully applicable to the SVR model.
Figure 8 shows that the three algorithms have some saturation problems in the training and test phases. In the training phase, the maximum value of the RF estimation (363.63 m
3 ha
−1) was larger than the other two algorithms (SVR, 336.79 m
3 ha
−1; MLR, 308.39 m
3 ha
−1), and the minimum value of the RF estimation was smaller (4.97 m
3 ha
−1) than the other two algorithms (SVR, 6.52 m
3 ha
−1; MLR, 5.52 m
3 ha
−1). In the test phase, although the maximum value estimated by MLR (290.26 m
3 ha
−1) was the largest, the minimum value (7.48 m
3 ha
−1) was the smallest, and the data range of the RF estimated value was the largest (7.48–272.23 m
3 ha
−1). This indicated that the RF model was the best model with which to estimate the FSV. In addition, the RF exhibited the best performance among the three algorithms according to the
R2 and RMSE. When using the RF model to predict the FSV in Hunan Province (
Figure 9), we found that the smallest FSV per hectare was 12.7421 m
3, which was larger than the smallest FSV measured in the field (1.42 m
3 ha
−1), and the largest FSV per hectare was 269.649 m
3, which was smaller than the highest FSV measured (577.50 m
3 ha
−1). This result indicated that the RF model overestimated the low FSV values and underestimated the high FSV values, which may be due to the common saturation problem in optical remote sensing vegetation analyses [
83]. The overestimation problem could be caused by the understory vegetation (e.g., shrub and grass), which typically impacts the reflectance values. The high FSV areas often have complex canopy structures, which may affect the reflectance values. A similar study by Ou et al. [
106] also found this was a common problem in the estimation of the FSV or biomass using multispectral remote sensing data. However, if the forest area with a low FSV and the forest area with a high FSV reach a certain ratio, the underestimation and overestimation problems of the RF will reach a certain balance. For this reason, when using the RF to estimate the FSV over a large area, it may offset some errors caused by the defects of the model or the image data. Regarding the SVR model,
Figure 8c shows two distinct trend lines consisting of points. This trend was caused by the large data volume (n = 321 in the training phase) and parameter optimization. Due to the parameter optimization, as many samples as possible were within this hyperplane; therefore, many data points were concentrated on the edge.
After mapping the FSV for the whole of Hunan Province, a statistic using the ENVI 5.3 “Quick Stats ” tool (Exelis Visual Information Solutions, Boulder, Colorado) was calculated, and the mean FSV was 39.09 m
3 ha
−1. For comparison, we calculated the mean FSV of Hunan Province based on the Analysis Report of Hunan Forestry Statistics Annual Report in 2017, and found that the value was 42.15 m
3 ha
−1 [
107]. The estimation accuracy of the mean FSV reached 92.74%. However, for the sample plot data, the mean values were 121.11 and 120.53 m
3 ha
−1 for the training and test, respectively. This result shows that among the sample plots we selected, there were too few samples with a small FSV. Using these data to directly model may cause some errors in the model prediction results. However, before modeling, a normal transformation was performed, and the mean values were 3.97 (
Table 4) and 3.04 (39.09
λ) before and after transformation, respectively (
Figure 3). The normal transformation narrowed this gap, which may compensate for the impact of uneven data sampling to some extent [
108]. Finally, the total FSV that we predicted in Hunan Province was 3.50 × 10
8 m
3 in forest areas. At the end of 2017, the Hunan provincial government published a report on the FSV stating that the total value was 5.48 × 10
8 m
3 [
107]. Based on this observation, the RF model reached an accuracy of 63.87% in predicting the FSV. We also focused on the forest area reported on the government report in 2017; this showed that the forest area was 1.3 × 10
7 ha [
107], which was different from that extracted from the PALSAR-2/PALSAR Forest/Non-Forest product in 2017 (8.95 × 10
6 ha). This difference is also an important factor that affects the estimation result of the FSV. In addition, we paid attention to the study conducted by Shen et al. [
83], who predicted the forest biomass in Guangdong Province using the RF model and remote sensing data, which achieved slightly lower accuracy (58.88%) than that observed in this study. However, one important difference is that Shen et al. estimated the biomass, and we estimated the FSV.
Since Hunan Province is located in southern China, the greatest challenge was to obtain images with little or no cloud data that covered the whole province. To reduce the effects of clouds and noise on the images, we used three masking methods. To reduce or even eliminate the holes caused by the masks on the images, we processed a total of 3150 images and extracted the median values of the overlapping image parts. As all the sample plots were in a relatively homogeneous forest environment (at least within 30 m of the sample plot boundary is homogeneous), the sample plots vector file was used to extract the mean values of the spanned pixels. Through this method, high-quality remote sensing data for southern China can be effectively obtained, and the potential of using Sentinel-2 to map the FSV in southern China is increased. Such a potential is truly important in southern China, where forest plantations grow rapidly in order to meet the demand for wood products, and for ecological protection [
109]. The dates of the selected remote sensing image are mainly based on two considerations. First, the trees in the forest grow slowly in one year. The difference between the acquisition time of the sample plot data and the remote sensing data is less than one year, which has little effect on the results. In addition, we also use the median value of the two-year growing season to reduce this effect. Furthermore, we consider that the spectral information between May and October is better, but the image in one year could not cover the entirety of Hunan Province, so we chose to use the data for two year periods.
Although our prediction results were good, some limitations of the study should be noted. The first are the limitations of the Sentinel-2 Level-1C TOA data. While three mask steps were used to process the Sentinel-2 Level-1C TOA data, this approach could not eliminate the bidirectional reflectance effects from changing the sun–sensor–surface geometries between the acquisitions. Further processing steps that transform the TOA data to bottom-of-atmosphere (BOA) surface data would improve the results; however, the GEE platform did not offer an online tool for this purpose. The local tools that can convert large-scale Sentinel-2 TOA data to BOA data are limited. According to Hird et al. [
57], although the TOA data were used, the annual composites of Sentinel-2 input variables would suffice to minimize the bidirectional effects to an acceptable degree. In this study, since approximately two years of Sentinel-2 data were used to extract the band characteristics and vegetation indices, it would be acceptable for this purpose. The second limitation is the modeling algorithm. Although the RF algorithm has been proven to exhibit outstanding performance, the phenomena of the underestimation of high values and the overestimation of low values always exists. The innovation of the algorithms to solve this problem will be a future research direction. The third limitation is the value distribution of the sample data. Although we conducted data transformation, this approach still cannot avoid the modeling bias caused by the uneven distribution of the data values of the sample. The fourth limitation is that not all the data analysis work can be implemented on the GEE platform. Due to the lack of some regression functions, we had to use the R software for auxiliary analyses, which seriously reduced the efficiency of the data analysis. As the Sentinel-2 BOA data are being continuously processed and uploaded to the GEE platform, we believe that the global data products will soon be available on the GEE platform. At the same time, we also noticed the good performance of deep learning algorithms in forest variable estimations [
110]. Based on these limitations, we suggest using Sentinel-2 BOA data, deep learning algorithms, and more reasonable sample plot data to conduct FSV estimation research in the future.