Efficient Data Clustering Algorithms: Improvements over Kmeans

Full Text (PDF, 752KB), PP.37-49

Views: 0 Downloads: 0

Author(s)

Mohamed Abubaker 1,* Wesam Ashour 1

1. Dept. of Computer Engineering, Islamic University of Gaza, Gaza, Palestine

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2013.03.04

Received: 4 Mar. 2012 / Revised: 21 Jul. 2012 / Accepted: 6 Oct. 2012 / Published: 8 Feb. 2013

Index Terms

Data Clustering, Random Initialization, Kmeans, K-Nearest Neighbor, Density, Noise, Outlier, Data Point

Abstract

This paper presents a new approach to overcome one of the most known disadvantages of the well-known Kmeans clustering algorithm. The problems of classical Kmeans are such as the problem of random initialization of prototypes and the requirement of predefined number of clusters in the dataset. Randomly initialized prototypes can often yield results to converge to local rather than global optimum. A better result of Kmeans may be obtained by running it many times to get satisfactory results. The proposed algorithms are based on a new novel definition of densities of data points which is based on the k-nearest neighbor method. By this definition we detect noise and outliers which affect Kmeans strongly, and obtained good initial prototypes from one run with automatic determination of K number of clusters. This algorithm is referred to as Efficient Initialization of Kmeans (EI-Kmeans). Still Kmeans algorithm used to cluster data with convex shapes, similar sizes, and densities. Thus we develop a new clustering algorithm called Efficient Data Clustering Algorithm (EDCA) that uses our new definition of densities of data points. The results show that the proposed algorithms improve the data clustering by Kmeans. EDCA is able to detect clusters with different non-convex shapes, different sizes and densities.

Cite This Paper

Mohamed Abubaker, Wesam Ashour, "Efficient Data Clustering Algorithms: Improvements over Kmeans", International Journal of Intelligent Systems and Applications(IJISA), vol.5, no.3, pp.37-49, 2013. DOI:10.5815/ijisa.2013.03.04

Reference

[1]M. Eirinaki and M. Vazirgiannis, 2003. Web Mining for Web Personalization, in ACM Transactions on Internet Technology (TOIT), vol. 3, no. 1 pp: 1-27. DOI: 10.1145/643477.643478

[2]B.Bahmani Firouzi, T. Niknam, and M. Nayeripour, Dec 2008. A New Evolutionary Algorithm for Cluster Analysis, Proc. of world Academy of Science, Engineering and Technology, vol. 36. http://www.waset.org/journals/waset/v46/v46-100.pdf

[3]Gersho and R. Gray, 1992. Vector Quantization and Signal Compression, Kulwer Acadimec, Boston. http://www-ee.stanford.edu/~gray/

[4]M. Al- Zoubi, A. Hudaib, A. Huneiti and B. Hammo, 2008. New Efficient Strategy to Accelerate k-Means Clustering Algorithm, American Journal of Applied Science, vol. 5, no. 9 pp: 1247-1250. DOI: 10.3844/ajassp.2008.1247. 1250

[5]M. Celebi, 2009. Effective Initialization of K-means for Color Quantization, Proc. of the IEEE International Conference on Image Processing, pp: 1649-1652. DOI: 10.1.1.151.5281

[6]M. Borodovsky and J. McIninch, 1993. Recognition of genes in DNA sequence with ambiguities, Biosystems, vol. 30, issues 1-3, pp: 161-171. DOI:10.1016/0303-2647(93)90068-N

[7]J. Bezdek and N. Pal, 1992. Fuzzy Models for Pattern Recognition, IEEE press (New York, NY, USA). http://www.amazon.com/Fuzzy-Models-Pattern-Recognition-Structures/dp/0780304225

[8]J. Bezdek, 1981. Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press (New York, NY, USA). http://www.amazon.com/Recognition-Objective-Function-Algorithms-Applications/dp/0306406713

[9]G. Gan, Ch. Ma, and J. Wu, 2007. Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM series on Statistics and Applied Probability, SIAM. DOI: 10.1111/j.1751-5823.2007.00039_2.x

[10]D. Defays, 1977. An Efficient Algorithm for A Complete Link Method, The Computer Journal, vol. 20, pp: 364-366. DOI: 10.1093/comjnl/20.4.364

[11]R. Sibson, 1973. SLINK: an Optimally Efficient Algorithm for the Single Link Cluster Method, The Computer Journal, vol. 16, No. 1, pp: 30-34. DOI: 10.1093/comjnl/16.1.30

[12]L. Kaufman, and P. Rousseeuw, 1990. Finding Groups in Data: An Introduction to Cluster Analysis, (John Wiley & Sons). DOI: 10.1002/9780470316801

[13]J. MacQueen, 1967. Some Methods for Classification and Analysis of Multivariate Observations, 5th Berkeley Symp. Math. Statist. Prob., vol. 1, pp: 281-297. http://digitalassets.lib.berkeley.edu/math/ucb/text/math_s5_v1_article-17.pdf

[14]U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, 1996. From Data Mining to Knowledge Discovery in Databases, Advances in Knowledge Discovery and Data Mining, AAAI/MIT press. DOI: 10.1.1.42.1071

[15]XindongWu and et. Al., 2008. Top 10 Algorithms in Data Mining, Journal of Knowledge and Information Systems, vol. 14. Issues 1-37. DOI: 10.1007/s10115-007-0114-2

[16]I.S. Dhillon, Y. Guan, and B. Kulis, 2004. Kernel Kmeans, Spectral Clustering and Normalized Cuts. ACM SIGKDD international conference Knowledge Discovery and Data Mining. (Seattle, WA). DOI: 10.1.1.79.2967

[17]M. Girolami, 2002. Mercer Kernel Based Clustering in Feature Space. IEEE Transactions on Neural Networks, 13(3), pp: 780-784. DOI:10.1.1.135.4576

[18]Scholkopf, A. Smola, and K.R. Muller, 1998. Nonlinear Component Analysis As a Kernel Eigenvalue Problem. Neural Computation (10), pp: 1299–1319. DOI: 10.1.1.100.3636

[19]J. Suykens, and J. Vandewalle, 1999. Least Squares Support Vector Machine Classifiers. Neural Processing Letters, 9(3), pp: 293–300. DOI: 10.1.1.7.2877

[20]W. Barbakh, and C. Fyfe, 2008. Clustering and Visualization with Alternative Similarity Functions. The 7th WSEAS international conference on artificial intelligence, knowledge, engineering and data bases, AIKED’08 (University of Cambridge, UK) pp: 238–244. http://www.wseas.us/e-library/conferences/2008/uk/AIKED/AIKED-34.pdf

[21]W. Barbakh, and C. Fyfe, 2008. Online Clustering Algorithms. International Journal of Neural Systems (IJNS), 18(3), pp: 185-194. http://www.uws.ac.uk/schoolsdepts/computing/documents/fyfe-online-clustering-algorithms.pdf

[22]R. Duda, P. Hart, and D. Stork, 2001. Pattern Classification, (John Wiley & Sons, second edition). DOI: 10.1007/s00357-007-0015-9

[23]N. Asgharbeygi, and A. Maleki, 2008. Geodesic Kmeans Clustering, 19th International Conference on Pattern Recognition. http://www.stanford.edu/~arianm/geo_kmeans_icpr08.pdf

[24]M. Ester, H. P. Kriegel, J. Sander, and X. Xu, 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial Data sets with Noise, 2nd International Conference on Knowledge Discovery and Data Mining, pp: 226-231. DOI: 10.1.1.121.9220

[25]E. Forgy, 1965. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classification, Biometrics, vol. 21. http://www.citeulike.org/user/dgrianna/article/3382461

[26]D. Hochbaum, and D. Shmoys, 1985. A Best Possible Heuristic for the K-Center Problem, Math. Oper. Res., vol. 10, no. 2, pp: 180-184. DOI: 10.1287/moor.10.2.180

[27]D. Arthur, and S. Vassilvitskii, 2007. Kmeans++: The Advantages of Careful Seeding, Proceeding of SODA’07, pp: 1027-1035. DOI: 10.1145/1283383.1283494

[28]M. Al-Daoud, 2005. A New Algorithm for Clustering Initialization, Proceeding World Academy of Science, Engineering, and Technology, vol. 4. http://www.waset.org/journals/waset/v4/v4-20.pdf

[29]W. Barbakh, and C. Fyfe, 2008. Local vs. Global Interactions in Clustering Algorithms: Advances over Kmeans, International Journal of knowledge-based and Intelligent Engineering Systems, vol. 12. http://iospress.metapress.com/content/6443723640674366/

[30]W. Barbakh and C. Fyfe, 2007. Inverse Weighted Clustering Algorithm, Computing and Information Systems, 11(2), pp: 10-18, ISSN 1352-9404. http://cis.uws.ac.uk/research/journal/V11n2/IWC.pdf

[31]W. Barbakh, 2007. The Family of Inverse Exponential Kmeans Algorithms, Computing and Information Systems, 11(1), pp: 1-10. ISSN 1352-9404. http://cis.uws.ac.uk/research/journal/vol11.htm

[32]Jim Z.C. Lai, and T. J. Huang, 2010. Fast Global Kmeans Clustering Using Cluster Membership and Inequality, Pattern Recognition, vol. 43, pp: 1954-1963. DOI:10.1016/j.patcog.2009.11.021

[33]G. Frahling, and Ch. Sohler, 2008. A Fast Kmeans Implementation Using Coresets, International Journal of Computational Geometry and Applications, vol. 18, Issue 6, pp: 605-625. DOI: 10.1145/1137856.1137879

[34]L. Taoying, and Y. Chen, 2008. An Improved Kmeans for Clustering Using Entropy Weighting measures, 7th World Congress on Intelligent Control and Automation. DOI: 10.1109/WCICA.2008.4592915

[35]S. Gupara, K. Rao, and V. Bhatnagar, 1999. Kmeans Clustering Algorithm for Categorical Attributes, Proceeding 1st International Conf. on Data Werehousing and Knowledge Discovery, pp: 203-208. DOI: 10.1.1.24.9576

[36]Z. Huang, 1998. Extensions to The Kmeans Algorithms for Clustering Large Data Sets with Categorical Values, Data Mining Knowledge Discovery, vol. 2, pp: 283-304. DOI: 10.1.1.15.4028