Arabic Text Categorization Using Mixed Words

Full Text (PDF, 578KB), PP.74-81

Views: 0 Downloads: 0

Author(s)

Mahmoud Hussein 1,* Hamdy M. Mousa 1 Rouhia M.Sallam 2

1. Faculty of Computers and Information, Menoufia University, Egypt

2. Faculty of Applied Sciences, Taiz University, Yemen

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2016.11.09

Received: 6 Oct. 2015 / Revised: 29 Mar. 2016 / Accepted: 12 Jun. 2016 / Published: 8 Nov. 2016

Index Terms

Arabic Text Categorization, Frequency Ratio Accumulation Method, Term and Document Frequency, Features Selection, and Mixed Words

Abstract

There is a tremendous number of Arabic text documents available online that is growing every day. Thus, categorizing these documents becomes very important. In this paper, an approach is proposed to enhance the accuracy of the Arabic text categorization. It is based on a new features representation technique that uses a mixture of a bag of words (BOW) and two adjacent words with different proportions. It also introduces a new features selection technique depends on Term Frequency (TF) and uses Frequency Ratio Accumulation Method (FRAM) as a classifier. Experiments are performed without both of normalization and stemming, with one of them, and with both of them. In addition, three data sets of different categories have been collected from online Arabic documents for evaluating the proposed approach. The highest accuracy obtained is 98.61% by the use of normalization.

Cite This Paper

Mahmoud Hussein, Hamdy M. Mousa, Rouhia M.Sallam, "Arabic Text Categorization Using Mixed Words", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.11, pp.74-81, 2016. DOI:10.5815/ijitcs.2016.11.09

Reference

[1]N.Tripathi, “Level Text Classification Using Hybrid Machine Learning Techniques” PhD thesis, University of Sunderland, 2012.

[2]Laila, K.,“ Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study” Conference on Data Mining | DMIN'06 |,2006,pp.78-82

[3]R. Al-Shalabi, G. Kanaan, and M. Gharaibeh “Arabic text categorization using kNN algorithm”,2006, pp.1-9. 

[4]S. Al-Harbi, A. Almuhareb, A. Al-Thubaity “Automatic Arabic text classification”,Journee’s internationals d’Analyse statistique des Données Textuelles, 2008, pp.77-83.

[5]F.Harrag, E.ElQawasmeh “Neural Network for Arabic text classification”, pp. 778 – 783,2009.

[6]F.Sebastiani, “ Machine learning in automated text categorization”ACM Computing Surveys,Vol. 34 number 1,2002,pp.1-47. 

[7]H.Sawaf, J.Zaplo, and H.Ney“Statistical Classification Methods for Arabic News Articles” Workshop on Arabic Natural Language Processing, ACL'01, Toulouse, France, July 2001.

[8]Y.Yang and X. Liu“ Re-examination of Text Categorization Methods“Proceedings of 22nd ACM International Conference on Research and  Development in Information Retrieval,SIGIR’99, ACM Press, New York, USA, 1999,pp. 42-49.

[9]B.Sharef, N.Omar, and Z.Sharef ”An Automated Arabic Text Categorization Based on the Frequency Ratio Accumulation” The International Arab Journal of Information Technology, Vol. 11, No. 2, March 2014, pp.213-221.

[10]R. Baeza-Yates, and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.

[11]M.Suzuki, S.Hirasawa “ Text Categorization Based on the Ratio of Word Frequency in Each Categories“In Proceedings of IEEE International Conference on Systems Man and Cybernetics, Montreal, Canada, 2007,pp. 3535-3540.

[12]H.Meryem, S.Ouatik, A.Lachkar“A Novel Method for Arabic Multi-WordTerm Extraction”International Journal of Database Management Systems (IJDMS) Vol.6, No.3, June 2014, pp.53-67.

[13]H.Meryem, A.Lachkar,S.Ouatik“Multi-Word Term Extraction Based onNew Hybrid Approach For Arabic Language”,2014pp.109-120.

[14]B.Al-Shargabi, WAL-Romimah andF.Olayah “A Comparative Study for Arabic Text Classification Algorithms Based on Stop Words Elimination” ACM, Amman, Jordan 978-1-4503-0474-0/04/2011.

[15]W.Zhang, T.Yoshida and X.Tang,“Text classification based on multi-word with support vector machine” 2008 Elsevier, pp. 879-886.

[16]A.Wahbeh, M.Al-Kabi, Q.Al-Radaidah, E.AlShawakfa and. I.Alsamdi “The Effect of Stemming on Arabic Text Classification: An Empirical Study” In International Journal of Information Retrieval Research (IJIRR), vol. 1, no. 3, I. 2011, 54-70.

[17]H.Nezreg,H.Lehbab, and H.Belbachir“ConceptualRepresentation Using WordNet for Text Categorization” International Journal of Computer and Communication Engineering, Vol. 3, No. 1, January 2014.

[18]A. Mesleh “Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System” Journal of Computer Science 3(6): 430-435, 2007.

[19]W.Zhang, T.Yoshida and X.Tang,“Text classification based on multi-word with support vector machine” 2008 Elsevier, pp. 879-886.

[20]Sh.Oraby, Y.El-Sonbatyand M.El-Nasr “Exploring the Effects of Word Roots for Arabic Sentiment Analysis” International Joint Conference on Natural Language Processing, 471–479,Nagoya, Japan, 14-18 October 2013.

[21]A.Ezzeldin, Y.El-SonbatyandM.Kholief“Exploring the Effects of Root Expansion “College of Computing and Information Technology, AASTMT Alexandria, Egypt,2013.

[22]J.Diederich, J.Kindermann, E.Leopold and G.Paass “Authorship attribution with support vector machines” Applied Intelligence,2003,pp.109-123. 

[23]R.Al-Shalabi,G.Kanaan, J.Jaam, A.HasnahandE.Hilat “Stop-word Removal Algorithm for Arabic Language”Proceedings of 1st International Conference on Information & Communication Technologies: from Theory to Applications,IEEE-France,2004,pp.545-550,CTTA'04 .

[24]M.El-Kourdi, A.Bensaid and T.Rachidi“Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm”20th International Conference on Computational Linguistics. August, Geneva, 2004.

[25]https://pythonhosted.org/Tashaphyne/Tashaphyne.normalize-module.html

[26]T. Kazem, E. Rania, and C. Je.rey“Arabic Stemming Without A Root Dictionary” Information Science Research Institute, USA, 2005.

[27]A. Kreaa, A. Ahmad and K. Kabalan“ ARABIC WORDS STEMMING APPROACH USING ARABIC ORDNET” International Journal of Data Mining & Knowledge Management Process (IJDKP) .

[28]https://pypi.python.org/pypi/Tashaphyne/Vol.4, No.6, November 2014.

[29]W.Pu, N.Liu“Local Word Bag Model for Text Categorization” Seventh IEEE International Conference on Data Mining, 2007, pp.625-630.

[30]H.Meryem, A.Lachkar,S.Ouatik“Multi-Word Term Extraction Based onNew Hybrid Approach For Arabic Language”,2014pp.109-120.

[31]K. El Khatib, A.Badarenh,“Automatic Extraction of Arabic Multi-word Term”Proceedings of the International Multiconference on Computer Science and Information Technology, 2010,pp.411-418.

[32]O.Garnes, “ Feature Selection for TextCategorization” Master thesis,Norwegian University of Science and Technology, June 2009.

[33]https://www.python.org/downloads/

[34]http://www.nltk.org/_modules/nltk/stem/isri.html

[35]M. Turk, and A. Pentland.“Eigenfaces for recognition. Journal of Cognitive Neuroscience” vol. 3, no. 1,1991, pp. 71 -86.

[36]R.Elhassan, M.Ahmed “Arabic Text Classification on Full Word” International Journal of Computer Science and Software Engineering (IJCSSE), Volume 4, Issue 5, May 201 5, pp.114-120.

[37]http://diab.edublogs.org/dataset-for-arabic-documentclassification/

[38]https://sites.google.com/site/mouradabbas9/corpora