Authors:
Ammar Al Abd Alazeez
;
Sabah Jassim
and
Hongbo Du
Affiliation:
The University of Buckingham, United Kingdom
Keyword(s):
Big Data, Data Stream Clustering, Outliers Detection, Prototype-based Approaches.
Related
Ontology
Subjects/Areas/Topics:
Clustering
;
Incremental Learning
;
Pattern Recognition
;
Theory and Methods
Abstract:
Data stream clustering is becoming an active research area in big data. It refers to group constantly arriving
new data records in large chunks to enable dynamic analysis/updating of information patterns conveyed by
the existing clusters, the outliers, and the newly arriving data chunk. Prototype-based algorithms for solving
the problem have their promises for simplicity and efficiency. However, existing implementations have
limitations in relation to quality of clusters, ability to discover outliers, and little consideration of possible
new patterns in different chunks. In this paper, a new incremental algorithm called Enhanced Incremental
K-Means (EINCKM) is developed. The algorithm is designed to detect new clusters in an incoming data
chunk, merge new clusters and existing outliers to the currently existing clusters, and generate modified
clusters and outliers ready for the next round. The algorithm applies a heuristic-based method to estimate
the number of clusters (K), a radius
-based technique to determine and merge overlapped clusters and a
variance-based mechanism to discover the outliers. The algorithm was evaluated on synthetic and real-life
datasets. The experimental results indicate improved clustering correctness with a comparable time
complexity to existing methods dealing with the same kind of problems.
(More)