Using String Kernel for Document Clustering

Full Text (PDF, 272KB), PP.40-46

Views: 0 Downloads: 0

Author(s)

Qingwei Shi 1,2,* Xiaodong Qiao 1 Xu Guangquan 3

1. Institute of Science & Technology Information of China, Beijing, China

2. School of Software Liaoning Technical University, Huludao, China

3. School of Computer Science & Technology Tianjin University, Tianjin, China

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2010.02.06

Received: 1 May 2010 / Revised: 13 Jul. 2010 / Accepted: 2 Oct. 2010 / Published: 8 Dec. 2010

Index Terms

Exploratory data analysis, document clustering, string kernel, spectral clustering, support vector machine

Abstract

In this paper, we present a string kernel based method for documents clustering. Documents are viewed as sequences of strings, and documents similarity is calculated by the kernel function. According to the documents similarity, spectral clustering algorithm is used to group documents. Experimental results shows that string kernel method outperform the standard k-means algorithm on the Reuters-21578 dataset.

Cite This Paper

Qingwei Shi, Xiaodong Qiao, Xu Guangquan, "Using String Kernel for Document Clustering", International Journal of Information Technology and Computer Science(IJITCS), vol.2, no.2, pp.40-46, 2010. DOI: 10.5815/ijitcs.2010.02.06

Reference

[1] A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review, ACM computing surveys, vol.31, September 1999, pp. 264-323.

[2] G. Salton, A. Wong, and C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, vol.18, November 1975, pp.613-620.

[3] I. S. Jacobs and C. P. Bean, Fine particles, thin films and exchange anisotropy, in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350.

[4] N. Cristianini and J. Shawe-Taylor, An introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 2000.

[5] H. Lodhi, N. Cristianini, J. Shawe-Taylor, and C. Watkins, Text classication using string kernel, The Journal of Machine Learning Research, vol.2, MIT Press Cambridge, MA, USA, 2001, pp.419-444.

[6] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, Cambridge, UK, 2004.

[7] J. Shi and J. Malik, Normalized cuts and image segmentation, Transactions on Pattern Analysis and Machine Intelligence, vol.22, 2000, pp.888–905.

[8] U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing, vol.17, Springer, 2007, pp.395-416.

[9] David Lewis. Reuters-21578 text categorization test collection, 1997 Copyright © 2010 MECS I.J. Information Technology and Computer Science, 2010, 2, 40-46 Using String Kernel for Document Clustering 46

[10] Charles Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik, Spectral grouping using the nystrom method, Transactions on Pattern Analysis and Machine Intelligence, vol.26(2), 2004, pp.214-225.

[11] S.V.N. Vishwanathan and A.J. Smola, Fast kernels for string and tree matching, Kernels and Bioinformatics, Cambridge, MA, MIT Press, 2004.

[12] K. I. Christoper, Williams and Matthias Seeger, Using the Nystrom method to speed up kernel machines, Advances in Neural Information Processing Systems 13, Cambridge, MA, MIT Press, 2001, pp. 682– 688.

[13] Charles Elkan, Using the triangle inequality to accelerate k-means, In Proceedings of the Twentieth International Conference on Machine Learning(ICML’03), 2003, pp. 147-153.

[14] M. Filippone, F. Camastra, and F. Masulli and S. Rovetta , A survey of kernel and spectral methods for clustering, Pattern Recognition, vol.41(1), Elsevier, 2008, pp. 176-190.

[15] Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, Technical Report TR-07-35, Computer Science, Virginia Tech, 2007.

[16] D. Yan, L. Huang and M.I. Jordan, Fast approximate spectral clustering, In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM Press, 2009, pp. 907-916.

[17] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, 2004.

[18] A. Karatzoglou and I. Feinerer, Kernel-based machine learning for fast text mining in R, Computational Statistics & Data Analysis, vol. 54(2) Elsevier, pp. 290-297.

[19] A. Moschitti, Syntactic and semantic kernels for short text pair categorization, In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 2009, pp. 576-584.

[20] O. Amayri and N. Bouguila, Improved Online Support Vector Machines Spam Filtering Using String Kernels, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, springer, 2009, pp. 621-628.

[21] S. Zheng, Y.Yang, H. Wu and W. Liu, Chinese Text Classification Using Key Characters String Kernel, Fifth International Conference on Semantics, Knowledge and Grid, 2009, pp.113-119.

[22] Ralf Herbrich, Learning Kernel Classifiers Theory and Algorithms, Adaptive Computation and Machine Learning, MIT Press, 2002.

[23] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm, In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, MIT Press, 2002.