An Email Modelling Approach for Neural Network Spam Filtering to Improve Score-based Anti-spam Systems

Full Text (PDF, 1125KB), PP.1-10

Views: 0 Downloads: 0

Author(s)

Yahya Alamlahi 1,* Abdulrahman Muthana 2

1. IT Operations and Systems Manager at CACBank®, Open University Malaysia (OUM) – UST Centre, Sana’a, Yemen

2. Thamar University, Yemen

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2018.12.01

Received: 26 Sep. 2018 / Revised: 7 Oct. 2018 / Accepted: 18 Oct. 2018 / Published: 8 Dec. 2018

Index Terms

Artificial Neural Networks, E-mail classification, Spam filtering, Machine learning, principal component analysis

Abstract

This research proposes a model for presenting email to Artificial Neural Network (ANN) to classify spam and legitimate emails. The proposed model based on selecting wise 13 fixed features relevant to spam emails combined with text features.
The experiment tests many scenarios to find out the best-suited combination of features representation. These scenarios show the effect of using term frequency (tf), term frequency-inverse document frequency (tf*idf), Level two (L2) normalization, and principal component analysis (PCA) for dimension reduction. Text features vectors are represented in the principal component space as a reduced form of the original features vectors. PCA reduction effect on ANN performance is also studied.
Among these tests, best-suited model that improves ANN classification and speeds up training is concluded and suggested. An idea of integrating ANN anti-spam filter into score-based anti-spam systems is also explained in this paper. XEAMS email gateway, the commercial anti-spam, already uses Na?ve Bayes (NB) filter as one of its many techniques to identify spam email. The proposed approach influences filtering results by 7.5% closer to XEAMS anti-spam system results than NB filter does on real-life emails of Arabic and English messages.

Cite This Paper

Yahya Alamlahi, Abdulrahman Muthana, "An Email Modelling Approach for Neural Network Spam Filtering to Improve Score-based Anti-spam Systems", International Journal of Computer Network and Information Security(IJCNIS), Vol.10, No.12, pp.1-10, 2018. DOI:10.5815/ijcnis.2018.12.01

Reference

[1]Manning, C. D., Raghavan, P., & Schütze, H. (2009). An Introduction to Information Retrieval. England: Cambridge University Press.
[2]Madigan, D. (2005). Statistics and The War on Spam. In D. Nolan, R. Peck, G. Casella, G. W. Cobb, & R. Hoerl, Statistics: A Guide to the Unknown (pp. 135-147). Duxbury Press.
[3]Goweder, A. M., Rashed, T., Elbekaie, A. S., & Alhammi, H. A. (2008). An Anti-Spam System Using Artificial Neural Networks and Genetic Algorithms. ACIT'2008 - the 2008 International Arab Conference on Information Technology, (pp. 1-8).
[4]Alamlahi, Y., & Ahmed, F. (2007). Sana’ani Dialect to Modern Standard Arabic: Rule-based Direct Machine Translation. In H. R. Arabnia, D. de la Fuente, E. B. Kozerenko, & J. A. Olivas (Ed.), ICAI'11 - The 2011 International Conference on Artificial Intelligence. II, pp. 892-895. Las Vegas Nevada: CSREA Press. Retrieved from http://worldcomp-proceedings.com/proc/proc2011/icai.html
[5]Blanzieri, E., & Bryl, A. (2008, March ). A Survey of Learning-Based Techniques of Email Spam Filtering. Journal Artificial Intelligence Review, 29(1), 63-92.
[6]Creech, G., & Jiang, F. (2012). Semantics Based Multi-layered Networks for Spam Email Detection. NUMERICAL ANALYSIS AND APPLIED MATHEMATICS ICNAAM 2012: International Conference of Numerical Analysis and Applied Mathematics. 1479, pp. 1518-1523. Kos, Greece: AIP Publishing. doi:10.1063/1.4756452
[7]Bansod, R., Mangrulkar , R. S., & Bhujade, V. G. (2015). Spam Classification using Artificial Neural Network with Weight Measures. International Journal of Advanced Computer Technology (IJACT), 4(6), 68-72.
[8]Synametrics Technologies. (n.d.). Xeams Web - Main Page. Retrieved June 10, 2016, from XEAMS official web site: http://www.xeams.com/
[9]Clark, J., Koprinska, I., & Poon, J. (2003). A Neural Network Based Approach to Automated E-mail Classification. null (p. 702). IEEE.
[10]Cui, B., Mondal, A., Shen, J., Cong, G., & Tan, K.-L. (2005). On Effective E-mail Classification via Neural Networks. In K. V. Andersen, J. Debenham, & R. Wagner (Ed.), 16th International Conference, DEXA 2005 (pp. 85-94). Copenhagen, Denmark: Springer Berlin Heidelberg. doi:10.1007/11546924_9.
[11]Unitec. (2010, May 27). Spam email datasets. ( Unitec) Retrieved June 16, 2016, from CSMining Group: http://csmining.org/index.php/spam-email-datasets-.html
[12]Stedfast, J. (2015). MimeKit. Retrieved June 23, 2016, from MimeKit: http://www.mimekit.net/
[13]Porter, M. (2006, January). The Porter Stemming Algorithm. Retrieved from Martin Porter's Home Page on tartarus.org: http://snowball.tartarus.org/algorithms/english/stemmer.html
[14]Becker, K. (2013, September 13). TF*IDF in C# .NET for Machine Learning - Term Frequency Inverse Document Frequency. Retrieved from Primary Objects, Software Development, Programming, AI: http://www.primaryobjects.com/2013/09/13/tf-idf-in-c-net-for-machine-learning-term-frequency-inverse-document-frequency/