Tehseen Zia

Work place: Department of Computer Science & IT, University of Sargodha, Sargodha, 4100, Sargodha

E-mail: tehseen_zia@yahoo.com

Website:

Research Interests: Computational Learning Theory, Data Mining

Biography

Tehseen Zia is working as faculty member in Department of Computer Science and IT, University of Sargodha since 2005.Currently, he is working as assistant professor. He got PhD scholarship from Higher Education Commission in 2007 and completed his PhD degree from Institute of Computer Technology, Vienna University of Technology in November 2010. His research interest is in machine learning and text mining.

Author Articles
Evaluation of Feature Selection Approaches for Urdu Text Categorization

By Tehseen Zia Qaiser Abbas Muhammad Pervez Akhtar

DOI: https://doi.org/10.5815/ijisa.2015.06.03, Pub. Date: 8 May 2015

Efficient feature selection is an important phase of designing an effective text categorization system. Various feature selection methods have been proposed for selecting dissimilar feature sets. It is often essential to evaluate that which method is more effective for a given task and what size of feature set is an effective model selection choice. Aim of this paper is to answer these questions for designing Urdu text categorization system. Five widely used feature selection methods were examined using six well-known classification algorithms: naive Bays (NB), k-nearest neighbor (KNN), support vector machines (SVM) with linear, polynomial and radial basis kernels and decision tree (i.e. J48). The study was conducted over two test collections: EMILLE collection and a naive collection. We have observed that three feature selection methods i.e. information gain, Chi statistics, and symmetrical uncertain, have performed uniformly in most of the cases if not all. Moreover, we have found that no single feature selection method is best for all classifiers. While gain ratio out-performed others for naive Bays and J48, information gain has shown top performance for KNN and SVM with polynomial and radial basis kernels. Overall, linear SVM with any of feature selection methods including information gain, Chi statistics or symmetric uncertain methods is turned-out to be first choice across other combinations of classifiers and feature selection methods on moderate size naive collection. On the other hand, naive Bays with any of feature selection method have shown its advantage for a small sized EMILLE corpus.

[...] Read more.
Other Articles