Text Classification Using SVM Enhanced by Multithreading and CUDA

Full Text (PDF, 570KB), PP.11-23

Views: 0 Downloads: 0

Author(s)

Soumick Chatterjee 1,* Pramod George Jose 2 Debabrata Datta 3

1. Otto von Guericke University, Magdeburg, Germany

2. Department of Cyber Security and Networks, Amrita University, Coimbatore, India

3. Department of Computer Science, St. Xavier’s College (Autonomous), Kolkata, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2019.01.02

Received: 8 Aug. 2018 / Revised: 10 Sep. 2018 / Accepted: 17 Oct. 2018 / Published: 8 Jan. 2019

Index Terms

Stemming, lemmatization, SVM, mutithreading, CUDA

Abstract

With the sudden growth of the internet and digital documents available on the web, the task of organizing text data has become a major problem. In recent times, text classification has become one of the main techniques for organizing text data. The idea behind text classification is to classify a given piece of text to a predefined class or category. In the present research work, SVM has been used with linear kernel using the One-V-Rest strategy. The SVM is trained using various data sets collected from various sources. It may so happen that some particular words were not so common around 5-6 years ago, but are currently prevalent due to recent trends. Similarly, new discoveries may result in the coinage of new words. This process can also be applied to text blogs which can be crawled and then analyzed. This technique should in theory be able to classify blogs, tweets or any other document with a significant amount of accuracy. In any text classification process, preprocessing phase takes the most amount of time – cleaning, stemming, lemmatization etc. Hence, the authors have used a multithreading approach to speed up the process. The authors further tried to improve the processing time of the algorithm using GPU parallelism using CUDA.

Cite This Paper

Soumick Chatterjee, Pramod George Jose, Debabrata Datta, "Text Classification Using SVM Enhanced by Multithreading and CUDA", International Journal of Modern Education and Computer Science(IJMECS), Vol.11, No.1, pp. 11-23, 2019.DOI: 10.5815/ijmecs.2019.01.02

Reference

[1] Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features, In Proceedings of European Conference on Machine Learning, 1998, pp. 137 – 142.

[2] M. Ikonomakis, S. Kotsiantis and V. Tampakas. Text Classification Using Machine Learning Techniques, WSEAS Transactions On Computers, Issue 8, Volume 4, 2005, pp. 966 – 974.

[3] Y. Yang. An Evaluation of Statistical Approaches to Text Categorization, Journal of Information Retrieval, 1(1/2), 1999, pp. 67 – 88.

[4] Yang Y., Zhang J. and Kisiel B. A Scalability Analysis of Classifiers in Text Categorization, In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[5] P Jason D. M. Rennie. Improving Multi-class Text Classification with Naive Bayes, Massachusetts Institute of Technology, 2001.

[6] J. Kivinen, M. Warmuth, and P. Auer. The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds When Few Input Variables Are Relevant, Artificial Intelligence, 1997, pp. 325 – 343.

[7] Cortes, C. and Vapnik, V. Support-vector Networks. Machine Learning, 1995, pp. 273–297.

[8] Thorsten Joachims. Transductive Inference for Text Classification using Support Vector Machines, In Proceedings of the Sixteenth International Conference on Machine Learning, 1999, pp. 200 – 209.

[9] Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, Khairullah khan. A Review of Machine Learning Algorithms for Text-Documents Classification, Journal of Advances In Information Technology, Vol. 1, No. 1, 2010, pp. 4 – 20.

[10] István Pilászy Text Categorization and Support Vector Machines, Department of Measurement and Information Systems. Budapest University of Technology and Economics.

[11] Liwei Wei, Bo Wei, Bin Wang Text Classification Using Support Vector Machine with Mixture of Kernel, Journal of Software Engineering and Applications, 2012, pp. 55 – 58.

[12] Anurag Sarkar, Saptarshi Chatterjee, Writayan Das, Debabrata Datta Text Classification using Support Vector Machine, International Journal of Engineering Science Invention. Volume 4 Issue 11, 2015, pp. 33 – 37.

[13] Durgesh K. Srivastava, Lekha Bhambhu. Data Classification Using Support Vector Machine, Journal of Theoretical and Applied Information Technology, Volume 12, No. 1, 2010, pp. 1 – 7.

[14] https://www.quantstart.com/articles/Support-Vector-Machines-A-Guide-for-Beginners, last accessed: 11:10 am, 26-Jul-18.

[15] Ryan Rifkin, MIT 9.520 Class 06, 25 Feb 2008, Multiclass Classification. Available at: http://www.mit.edu/~9.520/spring09/Classes/multiclass.pdf, last accessed: 11:25 am, 26-Jul-18.

[16] Bishop, M. Christopher. Pattern Recognition and Machine Learning. Springer, ISBN: 978-0-387-31073-2.

[17] https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html, last accessed: 11:30 am, 26-Jul-18.

[18] A. Rajaraman, J.D. Ullman. Data Minin. Mining of Massive Datasets, pp. 1–17, 2011, doi:10.1017/CBO9781139058452.002.

[19] https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html, last accessed: 11:35 am, 26-Jul-18.

[20] http://www.nltk.org/howto/wordnet.html, last accessed: 11:45 am, 26-Jul-18.

[21] Sivic, Josef. Efficient visual search of videos cast as text retrieval, IEEE Transactions On Pattern Analysis And Machine Intelligence, Volume 31, Number 4, 2009, pp. 591 – 605.

[22] http://scikit-learn.org/stable/modules/feature_extraction.html, last accessed: 12:15 pm, 26-Jul-18.

[23] https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html, last accseed: 12:30 pm, 26-Jul-18.

[24] Breitinger, Corinna; Gipp, Bela; Langer, Stefan. Research-paper recommender systems: a literature survey. International Journal on Digital Libraries. 17 (4), pp. 305 – 338, 2015, doi:10.1007/s00799-015-0156-0.

[25] Luhn, Hans Peter. "A Statistical Approach to Mechanized Encoding and Searching of Literary Information, IBM Journal of research and development. IBM. 1957, 1 (4): 315. doi:10.1147/rd.14.0309.

[26] Spärck Jones, K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval, Journal of Documentation. 28: pp. 11–21, 1972, doi:10.1108/eb026526.

[27] http://scikit-learn.org/stable/modules/svm.html, last accessed: 12:45 pm, 26-Jul-18.

[28] http://scikitlearn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html, last accessed: 1:10 pm, 26-Jul-18.

[29] sites.google.com/site/themetalibrary/library-genesis, last accessed: 1:30 pm, 26-Jul-2018.

[30] apache.org/dev/apply-license.html, last accessed: 1:35 pm, 26-Jul-2018.

[31] scikit-learn.org/stable/datasets/twenty_newsgroups.html, last accessed: 1:45 pm, 26-Jul-2018.

[32] qwone.com/~jason/20Newsgroups, last accessed: 1:40 pm, 26-Jul-2018.

[33] dev.twitter.com/streaming/overview, last accessed: 1:45 pm, 26-Jul-2018.

[34] dev.twitter.com/overview/terms/policy.html, last accessed: 1:55 pm, 26-Jul-2018.