This section discusses clustering results and followed by consideration in comparison to the k-means algorithm.
6.1. Clustering Results and Analysis
Figure 7 represents three normalized features of temperature data collected on August 22, 2007. There are eight kinds of symbols in the figure where each symbol indicates the sensors being set under the same environmental factors. As one would expect, the same symbols roughly position near each other in the 3D space. We can conclude that the sensors shown by the same symbols detect the same characteristic of temperature on the day of experiment.
Since the temperature variation differs day by day, we investigate temperature data by considering the distribution of defined clusters on one-day basis for a whole week during August 21–27, 2007. The percentages of sensor data in each cluster of each day are represented in
Figure 8. The temperature variation highly depends on the weather condition of each day (sunny, cloudy,
etc.). Thus we include the period of sunshine in percentage for every two hours from 8:00 a.m. to 8:00 p.m. in
Table 3. The data of sunshine period is coarse grain,
i.e., they are the percentages of sunshine period in the whole experimental area that covers all of eight installation points. Although the sunshine period over each sensor should be different from the approximate values shown in
Table 3, knowing such data is helpful when discussing the clustering results in this section. The data of sunshine period in the table are publicly available at the Japan Meteorological Agency website [
35].
In
Figure 8, cluster D is apparently distinct on the 23rd, 24th, and 25th where more than half of temperature data (
i.e., 96%, 77%, and 63%, respectively) fall under this cluster. The cluster D indicates positive bias (
Fdj1 ≥ 0.5), low changing rate (
Fdj2 < 0.5), and low maximum temperature (
Fdj3 < 0.5). Low amount of sunshine on the 23rd and 24th correlates to two features of cluster D,
i.e., low changing rate and low maximum temperature. Although the variation of sunshine does not obviously contribute to positive bias of temperature, the normalized bias of these two days is high enough to cross the border line of 0.5. Merely 4% of data on the 23rd fall under cluster A because of sensors which were installed toward the east and west were affected by the sunshine (sunrise and sunset) and
Fdj2 of a small amount of sensors are high enough to cross the threshold of 0.5. If the percentage of sunshine is high, more percentage of data should fall under cluster A. The amount of sunshine on the 25th directly leads to positive bias and low changing rate of temperature. However, the amount of sunshine is high on this sunny day; thereby normalized maximum temperature of some data (23%) is above the threshold of 0.5 and falls under other clusters. Therefore, the percentage of cluster D on the 25th (63%) is not so high as those of the 23rd (96%) and 24th (77%).
Although the ratio of cluster D on the 26th (38%) is less than half, it is the most distinct cluster of the day. The underlying reason is that the amount of sunshine is high in the morning in comparison with that of the afternoon. As a result, some of data (25%) show negative bias and fall under cluster C which is the second distinct cluster of the day. Note that the only difference between clusters C and D is the bias of temperature, i.e, the features of changing rate and maximum temperature are the same.
The most distinct cluster of the 27th is the cluster C (55%) which indicates negative bias (Fdj1 < 0.5), low changing rate (Fdj2 < 0.5), and low maximum temperature (Fdj3 < 0.5). The variation of sunshine obviously correlates to the properties of negative bias and low changing rate. However, some data show high maximum temperature due to high amount of sunshine in the morning. As a result, 45% of data fall under cluster G, the second-rank cluster of the day, where the only difference in comparison with cluster C is the maximum temperature. We note here that the sensors that were installed toward the east were affected by the sunrise in the morning and the maximum temperature is higher than the threshold of 0.5. If the percentage of sunshine is high all day (both morning and afternoon), the sensors that were installed toward the west should be affected by the sunset in the afternoon and most of data should fall in cluster G.
Cluster G occupies the highest ratio (40%) on the 22nd which is the sunniest day of the week. The result is plausible since cluster G indicates negative bias (Fdj1 < 0.5), low changing rate (Fdj2 < 0.5), and high maximum temperature (Fdj3 ≥ 0.5). Due to the stable amount of sunshine on this day, it is obvious that the maximum temperature should be high and the changing rate of temperature should be low. Also, the 22nd has negative bias because the amount of sunshine in the morning is higher than that of the afternoon.
Two clusters, D and H, equally occupy 30% of the temperature data collected on the 21st. Both clusters indicate positive bias (Fdj1 ≥ 0.5) and low changing rate (Fdj2 < 0.5), while the characteristic of maximum temperature is different. Cluster D indicates low maximum temperature (Fdj3 < 0.5), whereas cluster H shows the opposite one. The amount of sunshine clearly implies positive bias and low changing rate of temperature which are common characteristics of both clusters. It is intuitive that the maximum temperature of each sensor stay around the threshold, i.e., some is above and some is below; thus the temperature data fall under both clusters D and H.
6.2. Comparative Study
To study how well the proposed methodology presents the characteristics of the clusters, we include the clustering results based on the k-means algorithm in
Figure 9 where the number of clusters is set to eight. The eight clusters are named S, T, U, V, W, X, Y, and Z because the definitions of clusters differ from ours. In particular, the definition of cluster is determined by centroids of each cluster which are different on each day. For example, the centroids of each cluster on the 26th are shown in
Table 4.
It is apparent from
Figure 9 that there are no distinctive clusters on each day,
i.e., the percentages of each cluster are lower than 30%. As a result, we cannot have any insightful discussion and meaningful information based on these results. Therefore, we decide to map the above results to our definition of clusters (
i.e., the clusters A, B, C, D, E, F, G, and H). The centroid of each cluster is used as a criterion to map the whole cluster. For example, cluster S in
Table 4 (
Fdj1 = 0.1835,
Fdj2 = 0.5297, and
Fdj3 = 0.5768) falls under cluster F (
Fdj1 < 0.5,
Fdj2 ≥ 0.5, and
Fdj3 ≥ 0.5).
Figure 10 shows the results of mapping k-means clusters for the whole week (August 21–27, 2007).
The results of our method (
Figure 8) and k-means algorithm (
Figure 10) are exactly the same on the 22nd, 23rd, and 24th, while the results are slightly different on the 21st, 25th, 26th, and 27th. However, the trends of clustering results or distinctive clusters are exactly identical. Thus we conclude that our proposed method presents the characteristics of the clusters as well as those of the k-means algorithm.
When considering computational complexity, the proposed clustering technique is linear,
i.e., O(2
DM(2
n + 1)), while the clustering of k-means algorithm [
24] can be calculated in exponential time,
i.e., O(
DMxn+1 log
M), where
x is the number of clusters. Obviously, the proposed clustering is lightweight and much faster than the k-means algorithm.