An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm

Full Text (PDF, 441KB), PP.64-73

Views: 0 Downloads: 0

Author(s)

Maedeh Afzali 1,* Suresh Kumar 1

1. Manav Rachna International Institute of Research and Studies, Faridabad, 121004, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2018.09.08

Received: 25 Jun. 2018 / Revised: 7 Jul. 2018 / Accepted: 14 Jul. 2018 / Published: 8 Sep. 2018

Index Terms

Text Document Clustering, Similarity Measures, Dissimilarity Measures, Distance Measures, K-means Algorithm

Abstract

In today’s world tremendous amount of unstructured data, especially text, is being generated through various sources. This massive amount of data has lead the researchers to focus on employing data mining techniques to analyse and cluster them for an efficient browsing and searching mechanisms. The clustering methods like k-means algorithm perform through measuring the relationship between the data objects. Accurate clustering is based on the similarity or dissimilarity measure that is defined to evaluate the homogeneity of the documents. A variety of measures have been proposed up to this date. However, all of them are not suitable to be used in the k-means algorithm. In this paper, an extensive study is done to compare and analyse the performance of eight well-known similarity and dissimilarity measures that are applicable to the k-means clustering approach. For experiment purpose, four text document data sets are used and the results are reported.

Cite This Paper

Maedeh Afzali, Suresh Kumar, "An Extensive Study of Similarity and Dissimilarity Measures Used for Text Document Clustering using K-means Algorithm", International Journal of Information Technology and Computer Science(IJITCS), Vol.10, No.9, pp.64-73, 2018. DOI:10.5815/ijitcs.2018.09.08

Reference

[1]S. Grimes, "A Brief History of Text Analytics," BeyeNetwork, October, vol. 30, 2007.

[2]Ishwarappa and J. Anuradha, "A brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology," Procedia Computer Science, vol. 48, pp. 319-324, 2015.

[3]J. Han, M. Kamber and J. Pei, "Data Mining: Concepts and Techniques", Amsterdam: Morgan Kaufmann, 2012.

[4]D. B. Patila and Y. V. Dongreb, "A Fuzzy Approach for Text Mining," International Journal of Mathematical Sciences and Computing (IJMSC), vol. 4, pp. 34-43, 2015.

[5]X. Wu et al., "Top 10 Algorithms in Data Mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, 2008.

[6]P. Rathore, D. Kumar, J.C. Bezdek, S. Rajasegarar and M. S. Palaniswami, " A Rapid Hybrid Clustering Algorithm for Large Volumes of High Dimensional Data," IEEE Transactions on Knowledge and Data Engineering, 2018.

[7]A. Huang, "Similarity Measures for Text Document Clustering," in Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49-56.

[8]P. Singh and M. Sharma, "Text Document Clustering and Similarity Measures," Dept. of Computer Science & Engg, IIT Khanpur, India 2013.

[9]M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," in KDD Workshop on Text Mining, 2000, vol. 400, no. 1, pp. 525-526: Boston.

[10]N. Garga and R. K. Gupta, "Exploration of Various Clustering Algorithms for Text Mining," International Journal of Education and Management Engineering, vol. 8, no. 4, pp. 10-18, Aug 2018.

[11]G. Williams, "Hands-on Data Science with R Text Mining," 2014.

[12]M. F. Porter, "An Algorithm for Suffix Stripping," Program, vol. 14, no. 3, pp. 130-137, 1980.

[13]E. Haddi, X. Liu, and Y. Shi, "The Role of Text Pre-processing in Sentiment Analysis," Procedia Computer Science, vol. 17, pp. 26-32, 2013.

[14]R. Zhao and K. Mao, "Fuzzy Bag-of-Words Model for Document Representation," IEEE Transactions on Fuzzy Systems, vol.26, no. 2, pp. 794-804, 2018.

[15]G. Salton, A. Wong, and C. S. Yang, "A Vector Space Model for Automatic Indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, Jan 1975.

[16]G. Salton and C. Buckley, "Term-weighting Approaches in Automatic Text Retrieval," Information Processing & Management, vol. 24, no. 5, pp. 513-523, 1988.

[17]M. Jabalameli, A. Arman and M. Nematbakhsh, "Improving the Efficiency of Term Weighting in Set of Dynamic Documents," International Journal of Modern Education and Computer Science, vol. 7, no. 2, pp. 42-47, Aug 2015.

[18]M. Afzali and S. Kumar, "Comparative analysis of various similarity measures for finding similarity of two documents," International Journal of Database Theory and Application, vol. 10, no. 2, pp. 23-30, 2017.

[19]A. Luthra and S. Kumar, "Extension of K-Modes Algorithm for Generating Clusters Automatically," International Journal of Information Technology and Computer Science (IJITCS), vol. 8, no. 3, pp. 51-57, DOI: 10.5815/ijitcs.2016.03.06, March 2016, ISSN: 2074-9007 (Print), ISSN: 2074-9015 (Online).

[20]S. Kumar and M. Garg, "Improving the Initial Centroids of K-Means Clustering Algorithm to Generalize its Applicability," Journal of the Institution of Engineers (India): Series B, vol. 95, no. 4, pp. 345-350, ISSN: 345-350, ISSN: 2250-2106, DOI: 10.1007/s40031-014-0106-z, July 2014.

[21]A. Khandare and A. Alvi, "Efficient Clustering Algorithm with Enhanced Cohesive Quality Clusters," International Journal of Intelligent Systems and Applications, vol. 10, no. 7, pp.48-57, Aug  2018.

[22]A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern recognition letters, vol. 31, no. 8, pp. 651-666, 2010.

[23]A. Maedeh and K. Suresh, "Design of Efficient K-Means Clustering Algorithm with Improved Initial Centroids," MR International Journal of Engineering and Technology, vol. 5, no. 1, pp. 33-38, ISSN: 0975-4997, June 2013.

[24]S.-S. Choi, S.-H. Cha, and C. C. Tappert, "A survey of binary similarity and distance measures," Journal of Systemics, Cybernetics and Informatics, vol. 8, no. 1, pp. 43-48, 2010.

[25]J. Ma, X. Jiang, M. Gong, "Two-phase Clustering Algorithm with Density Exploring Distance Measure," CAAI Transactions on Intelligence Technology, vol. 3, no. 1, pp. 59-64, Jan 2018.

[26]F. Bellot and E. E. Krause, "Taxicab Geometry: An Adventure in Non-Euclidean Geometry," The Mathematical Gazette, vol. 72, no. 461, p. 255, 1988.

[27]Y. S. Lin, J. Y. Jiang, and S. J. Lee, "A Similarity Measure for Text Classification and Clustering," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 7, pp. 1575-1590, 2014.

[28]M. G. Michie, "Use of the Bray-Curtis Similarity Measure in Cluster Analysis of Foraminiferal Data," Journal of the International Association for Mathematical Geology, vol. 14, no. 6, pp. 661-667, 1982.

[29]J. Y. Jiang, W. H. Cheng, Y. S. Chiou, and S. J. Lee, "A Similarity Measure for Text Processing," in Machine Learning and Cybernetics (ICMLC), 2011 International Conference on, 2011, vol. 4, pp. 1460-1465: IEEE.

[30]S. Santini and R. Jain, "Similarity Measures," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp. 871-883, 1999.

[31]D. Greene and P. Cunningham, "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering," in Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 377-384: ACM.

[32]"Classic3 and Classic4 DataSets | Data Mining Research," Data Mining Reaserch RSS. [Online]. Available: http://www.dataminingreaserch.com/index.php/2010/09/classic3-classic4-classic4-datasets/. [Accessed: 27- Jul-2018].

[33]"CMU World Wide Knowledge Base (Web->KB) Project," [Online]. Available: http://www.cs.cmu.edu/~webkb/. [Accessed: 15-Jan-2018].