Evaluation of Feature Selection Approaches for Urdu Text Categorization

Full Text (PDF, 505KB), PP.33-40

Views: 0 Downloads: 0

Author(s)

Tehseen Zia 1,* Qaiser Abbas 1 Muhammad Pervez Akhtar 1

1. Department of Computer Science & IT, University of Sargodha, Sargodha, 4100, Sargodha

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2015.06.03

Received: 10 Aug. 2014 / Revised: 20 Dec. 2014 / Accepted: 11 Feb. 2015 / Published: 8 May 2015

Index Terms

Text Categorization, Feature Selection, Urdu, Performance Evaluation, Test Collection

Abstract

Efficient feature selection is an important phase of designing an effective text categorization system. Various feature selection methods have been proposed for selecting dissimilar feature sets. It is often essential to evaluate that which method is more effective for a given task and what size of feature set is an effective model selection choice. Aim of this paper is to answer these questions for designing Urdu text categorization system. Five widely used feature selection methods were examined using six well-known classification algorithms: naive Bays (NB), k-nearest neighbor (KNN), support vector machines (SVM) with linear, polynomial and radial basis kernels and decision tree (i.e. J48). The study was conducted over two test collections: EMILLE collection and a naive collection. We have observed that three feature selection methods i.e. information gain, Chi statistics, and symmetrical uncertain, have performed uniformly in most of the cases if not all. Moreover, we have found that no single feature selection method is best for all classifiers. While gain ratio out-performed others for naive Bays and J48, information gain has shown top performance for KNN and SVM with polynomial and radial basis kernels. Overall, linear SVM with any of feature selection methods including information gain, Chi statistics or symmetric uncertain methods is turned-out to be first choice across other combinations of classifiers and feature selection methods on moderate size naive collection. On the other hand, naive Bays with any of feature selection method have shown its advantage for a small sized EMILLE corpus.

Cite This Paper

Tehseen Zia, Qaiser Abbas, Muhammad Pervez Akhtar, "Evaluation of Feature Selection Approaches for Urdu Text Categorization", International Journal of Intelligent Systems and Applications(IJISA), vol.7, no.6, pp.33-40, 2015. DOI:10.5815/ijisa.2015.06.03

Reference

[1]D.Balie, “Baseline Information Extraction: Multilingual Information Extraction from Text with Machine Learning and Natural Language Techniques”. Technical Report, University of Ottawa, 2005.
[2]D.Blei, “Probabilistic Topic Models”. Communication of The ACM, 55(4), 2012.
[3]C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition”, Data Mining and Knowledge Discovery, 2, 121-167, 1998.
[4]C.Chung Chang and Chih-Jen Lin, “LIBSVM : A Library for Support Vector Machines”, ACM Transactions on Intelligent Systems and Technology, 2(3), 2011.
[5]W.Cohen, Y.Singer, “Context Sensitive Learning Methods for Text Categorization”, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 307-315, 1991.
[6]N.Fuhr, C.Buckley, “A Probabilistic Learning Approach for Document Indexing”. ACM Transaction on Information Systems, 9(3), 223-248, 1991.
[7]M.Hall, E.Frank, G.Holmes, B.Pfahringer, P.Reutemann, L. H.Witten, “The Weka Data Mining Software: An Update”. SIGKKD Exploration, 11(1), 10-18, 2009.
[8]B.Jiang, D.Xiang-Qian, M.Lin-Tao, H.Ying, T.Wang, X.Wei-Wie, The Second International Symposium on Optimization and Systems Biology, 2008.
[9]T.Joachims, “Text Catagorization with Support Vector Machines: Learning with many Relevant Features”, Tenth European Conference on Machine Learning (ECML-98), 137-142, 1998.
[10]T.Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, Proceedings of the Fourteenth International Conference on Machine Learning, 143-151, 1997.
[11]K.Kira, L.A.Rendell, In Proceedings of the ninth international workshop on Machine learning, pp: 249-256, 1992.
[12]Shrivastava, J.N., Bindu, M.H., “E-mail Spam Filtering Using Adaptive Genetic Algorithm”, I.J. Intelligent Systems and Applications, 02, 54-60, 2014,
[13]D.D.Lewis, M.Ringuette, “Comparision of Two Learning Algorithms for Text Categorization”, In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 1991.
[14]D.D.Lewis, Y.Yang, T.Ross, F.Li, “RCV1: A New Benchmark Collection for Text Categorization Research”, Journal of Machine Learning Research, 5, 361-397, 2004.
[15]H.Y.Li, A.K.Jain, “Classification of Text Documents”.Computer Journal. 41(8), 537-546, 1998.
[16]McCallum, A.Kachites, "MALLET: A Machine Learning for Language Toolkit." http://mallet.cs.umass.edu, 2002.
[17]T.Mitchell, “Machine Learning”.McGraw-Hill, New York, 1997.
[18]D.Mladenic, M.Grobelnik, “Feature selection for unbalanced class distribution and Naïve Bayes”. Proc. of the 16th Int. Conference on Machine Learning San Francisco: Morgan Kaufmann, pp. 258–267, 1999.
[19]I.Moulinier, G.Raskinis, J.Ganascia, “Text Categorization: A Symbolic Approach”, In Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, 1996.
[20]H. T.Ng, W. B.Goh, K. L.Low, “Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization”, In Proceedings of SIGIR-97, 20th ACM International Conference on Research and Development, 1997.
[21]K.Nigam, K.A.Mccallum S.Thrun, T.Mitchell, “Text Classification from Labeled and Unlabeled Documents using EM”, Machine Learning, 39(2/3), 103-134, 2000.
[22]J. R. Quinlan, “C4.5: Programs for Machine Learning”, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1993.
[23]K.Riaz “Rule-Based Name Entity Recognition in Urdu”, Proceeding to Name Entity Workshop, 126-135, 2010.
[24]M.Rogati, Y.Yang, “High-Performing Feature Selection for Text Classification”, In Proceedings of the eleventh international conference on Information and knowledge management, 2002.
[25]F.Sebastiani, “Machine Learning in Automated Text Categorization”.ACM Computing Surveys, 34(1), 1-47, 2002.
[26]E.Wiener, J.O.Pedersen, A.S.Wiegend, “A Neural Network Approach to Topic Spotting”, In Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, 1995.
[27]I.H.Witten, G.W.Paynter, E.Frank, C.Gutwin, and C.G. Nevill-Manning, “Kea: Practical Automatic Keyphrase Extraction”, Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, Information Science Publishing, 129-152, 2005.
[28]I.H.Witten, E.Frank, “Data Mining: Practical Machine Learning Tools and Techiques”, Second edition, Morgan Kaufman Publishers, 2005.
[29]Y.Yang, “Expert network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval”, Proceedings of ACMSIGIR, 13-2, 1994.
[30]Qaiser, A., “ A Stochastic Prediction Interface for Urdu”,I.J. Intelligent Systems and Applications, 01, 94-100, 2015.
[31]Ali, R.A., Ijaz, M., “Urdu Text Classification”, Proceedings of the 7th International Conference on Frontiers of Information Technology (FIT09), 2009.
[32]Abbas, Q., "Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank." In Computational Linguistics and Intelligent Text Processing, pp. 66-79. Springer Berlin Heidelberg, 2012.
[33]Shahzadi, F., Zia, T., “An Empirical Study on Sentiment Polarity Classification of Book Reviews”, VFAST Transaction on Software Engineering, 2013.