DIMK-means ―“Distance-based Initialization Method for K-means Clustering Algorithm”

Full Text (PDF, 815KB), PP.41-51

Views: 0 Downloads: 0

Author(s)

Raed T. Aldahdooh 1,* Wesam Ashour 1

1. Computer Engineering Dept., Islamic University of Gaza (IUG), Gaza, Palestine

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2013.02.05

Received: 1 May 2012 / Revised: 20 Sep. 2012 / Accepted: 7 Nov. 2012 / Published: 8 Jan. 2013

Index Terms

K-means Algorithm, Clustering Algorithm, Cluster Centroid Initialization, Initializing K-means, K-means Seeding Technique

Abstract

Partition-based clustering technique is one of several clustering techniques that attempt to directly decompose the dataset into a set of disjoint clusters. K-means algorithm dependence on partition-based clustering technique is popular and widely used and applied to a variety of domains. K-means clustering results are extremely sensitive to the initial centroid; this is one of the major drawbacks of k-means algorithm. Due to such sensitivity; several different initialization approaches were proposed for the K-means algorithm in the last decades. This paper proposes a selection method for initial cluster centroid in K-means clustering instead of the random selection method. Research provides a detailed performance assessment of the proposed initialization method over many datasets with different dimensions, numbers of observations, groups and clustering complexities. Ability to identify the true clusters is the performance evaluation standard in this research. The experimental results show that the proposed initialization method is more effective and converges to more accurate clustering results than those of the random initialization method.

Cite This Paper

Raed T. Aldahdooh, Wesam Ashour, "DIMK-means “Distance-based Initialization Method for K-means Clustering Algorithm”", International Journal of Intelligent Systems and Applications(IJISA), vol.5, no.2, pp.41-51, 2013. DOI:10.5815/ijisa.2013.02.05

Reference

[1]wikipedia. (2012, April) wikipedia. [Online]. http://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

[2]T. Abraham and J. F. Roddick, "Survey of Spatio-Temporal Databases," GeoInformatica, vol. 3, March 1999.

[3]D. Birant and A. Kut, "ST-DBSCAN: an algorithm for clustering spatial–temporal data," Data & Knowledge Engineering, vol. 60, pp. 208-221, 2007.

[4]J. Han and M. Kamber, "Data Mining Concepts and Techniques, Morgan Kaufmann Publishers," San Francisco, CA, pp. 335–391, 2001.

[5]J. Han and M. Kamber, "Data Mining Concepts and Techniques," Morgan Kaufmann Publishers, San Francisco, CA, pp. 335–391, 2001.

[6]J. Han, M. Kamber, and A.K.H. Tung, "Spatial clustering methods in data mining," Geographic Data Mining and Knowledge Discovery, Taylor & Francis, London, 2001.

[7]M. Sadaak, Y. Endo, S. Hayakawa, and E. Kataoka, "Classification and clustering of information objects based on fuzzy neighborhood system," IEEE Internat.i, Conf. on Systems, Man and Cybernetics, Hawai, pp. 3210-3215, 2005.

[8]A. Jain and R. Dubes, "Algorithms for Clustering Data," Prentice Hall, 1988.

[9]J. MacQueen, "Some methods for classification and analysis of multivariate observations," Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967.

[10]H. Vinod, "Integer programming and the theory of grouping," Journal of the American Statistical Association, vol. 64, pp. 506–519, 1969.

[11]J.T. Tou and R.C. Gonzalez, "Pattern Recognition Principles," Addison-Wesley, Reading, MA, 1974.

[12]S.Z. Selim and M.A. Ismail, "K-means type a lgorithms: a generalized convergence theorem and characterization of local optimality," IEEE Trans. Pattern Anal, vol. 6, pp. 81–87, Mach 1984.

[13]H. Spath, "Cluster Analysis Algorithms," Ellis H orwood, Chichester, UK, 1989.

[14]D. Arthur and S. Vassilvitskii, "k-means++: The advantages of careful seeding," ACM-SIAM Symposium on Discrete Algorithms (SODA 2007) Astor Crowne Plaza, New Orleans, Louisiana, pp. 1–11, 2007.

[15]R. Maitra, "Initializing partition-optimization algorithms," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, pp. 144–157, 2009.

[16]Chun Sheng Li, "Cluster Center Initialization Method for K -means Algorithm Over Data Sets with Two Clusters," 2011 International Conference on Advances in Engineering, vol. 24, pp. 324 – 328, 2011.

[17]Kohei Arai and Ali Ridho Barakbah, "Hierarchical K-means: an algorithm for centroids initialization for K-means ," Rep. Fac. Sci. Engrg, Saga Univ. , vol. 36, 2007.

[18]M.C. Naldi, R.J.G.B. Campello, E.R. Hruschka, and A.C.P.L.F. Carvalho, "Efficiency issues of evolutionaryk-means," Applied Soft Computing, vol. 11 , pp. 1938–1952, (2011).

[19]Ting Su and Jennifer Dy, "A Deterministic Method for Initializing K-means Clustering," Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE International Conference, pp. 784 - 786 , Nov 2004. 

[20]S. Kalyani and K.S. Swarup, "Particle swarm optimization based K-means clustering approach for security assessment in power systems," Expert Systems with Applications, vol. 30, pp. 10839–10846, 2011.

[21]E. H., "Ruspini," Numerical methods for fuzzy clustering. InformationScience, vol. 2, pp. 319-350, 1970.

[22]University of Massachusetts Amherst. Funding support from the National Science Foundation. UC Irvine Machine Learning Repository. [Online]. http://archive.ics.uci.edu/ml/

[23]Oded Maimon and Lior Rokach, Data Mining And Knowledge Discovery Handbook, 1st ed., 978-0387244358, Ed.: amazon, 2005.

[24]H. Bozdogan, "Akaike’s Information Criterion and Recent Developments in Information Complexity," Journal of Mathematical Psychology, vol. 44, pp. 62–91, 2000.

[25]wikipedia. [Online]. http://en.wikipedia.org/wiki/Akaike_information_criterion

[26]G. Schwarz, "Estimating the dimension of a model," Annals of Statistics, vol. 6(2), pp. 461-464, 1978.