Estimating the Sample Size for Training Intrusion Detection Systems

Full Text (PDF, 979KB), PP.1-10

Views: 0 Downloads: 0

Author(s)

Yasmen Wahba 1,* Ehab ElSalamouny 1 Ghada ElTaweel 1

1. Faculty of Computers and Informatics, Suez Canal University, Ismailia, Egypt

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2017.12.01

Received: 14 May 2017 / Revised: 5 Aug. 2017 / Accepted: 12 Sep. 2017 / Published: 8 Dec. 2017

Index Terms

Intrusion detection, Nonlinear regression, Naive Bayes, Learning curve, Power law

Abstract

Intrusion detection systems (IDS) are gaining attention as network technologies are vastly growing. Most of the research in this field focuses on improving the performance of these systems through various feature selection techniques along with using ensembles of classifiers. An orthogonal problem is to estimate the proper sample sizes to train those classifiers. While this problem has been considered in other disciplines, mainly medical and biological, to study the relation between the sample size and the classifiers accuracy, it has not received a similar attention in the context of intrusion detection as far as we know.
In this paper we focus on systems based on Na?ve Bayes classifiers and investigate the effect of the training sample size on the classification performance for the imbalanced NSL-KDD intrusion dataset. In order to estimate the appropriate sample size required to achieve a required classification performance, we constructed the learning curve of the classifier for individual classes in the dataset. For this construction we performed nonlinear least squares curve fitting using two different power law models. Results showed that while the shifted power law outperforms the power law model in terms of fitting performance, it exhibited a poor prediction performance. The power law, on the other hand, showed a significantly better prediction performance for larger sample sizes.

Cite This Paper

Yasmen Wahba, Ehab ElSalamouny, Ghada ElTaweel, "Estimating the Sample Size for Training Intrusion Detection Systems", International Journal of Computer Network and Information Security(IJCNIS), Vol.9, No.12, pp.1-10, 2017. DOI:10.5815/ijcnis.2017.12.01

Reference

[1]G. R. L. Figueroa, Q. Zeng-Treitler, S. Kandula, and L. H. Ngo, “Predicting sample size required for classification performance,” BMC Medical Informatics and Decision Making, vol. 12, p. 8, Feb 2012.
[2]C. Perlich, Learning Curves in Machine Learning, pp.577-580. Boston, MA: Springer US, 2010.
[3]B. Gu, F. Hu, and H. Liu, Modelling Classification Performance for Large Data Sets, pp. 317-328. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001.
[4]G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R, vol. 103 of Springer Texts in Statistics. Springer New York, 2013.
[5]C. Brezinski and M. Zaglia, Extrapolation Methods: Theory and Practice, vol. 2 of Studies in Computational Mathematics. Elsevier, 2013.
[6]W. Bul’ajoul, A. James, and M. Pannu, “Improving network intrusion detection system performance through quality of service configuration and parallel technology,” Journal of Computer and System Sciences, vol. 81, no. 6, pp. 981-999, 2015. Special Issue on Optimisation, Security, Privacy and Trust in E-business Systems.
[7]N. Khamphakdee, N. Benjamas, and S. Saiyod, “Improving intrusion detection system based on snort rules for network probe attacks detection with association rules technique of data mining,” Journal of ICT Research and Applications, vol. 8, no. 3, pp. 234-250, 2015.
[8]A. Stetsko, T. Smolka, V. Matyá?, and M. Stehlík, Improving Intrusion Detection Systems for Wireless Sensor Networks, pp. 343-360. Cham: Springer International Publishing, 2014.
[9]M. Govindarajan, “Hybrid intrusion detection using ensemble of classification methods,” International Journal of Computer Network and Information Security (IJCNIS), vol. 6, no. 2, pp. 45-53, 2014.
[10]K. Atefi, S. Yahya, A. Y. Dak, and A. Atefi, A hybrid intrusion detection system based on different machine learning algorithms, pp. 312-320. Kedah, Malaysia: Universiti Utara Malaysia, 2013.
[11]Y. Wahba, E. ElSalamouny, and G. Eltaweel, “Improving the performance of multi-class intrusion detection systems using feature reduction,” International Journal of Computer Science Issues (IJCSI), vol. 12, no. 3, pp. 255-262, 2015.
[12]K. Bajaj and A. Arora, “Improving the performance of multi-class intrusion detection systems using feature reduction,” International Journal of Computer Science Issues (IJCSI), vol. 10, no. 4, pp. 324-329, 2013.
[13]V. Bolón-Canedo, N. Sánchez-Maro?o, and A. Alonso-Betanzos,“Feature selection and classification in multiple class datasets: An application to kdd cup 99 dataset,” Expert Systems with Applications, vol. 38, no. 5, pp. 5947- 5957, 2011.
[14]S. Mukherjee and N. Sharma, “Intrusion detection using naive bayes classifier with feature reduction,” Procedia Technology, vol. 4, pp. 119–128, 2012. 2nd International Conference on Computer, Communication, Control and Information Technology (C3IT-2012) on February 25-26, 2012.
[15]J. Song, Z. Zhu, P. Scully, and C. Price, “Modified mutual information-based feature selection for intrusion detection systems in decision tree learning,” Journal of Computer, vol. 9, no. 7, pp. 1542-1546, 2014.
[16]L.-S. Chen and J.-S. Syu, Feature Extraction based Approaches for Improving the Performance of Intrusion Detection Systems, pp. 286-291. International Association of Engineers (IAENG), 2015.
[17]S. Singh, S. Silakari, and R. Patel, An efficient feature reduction technique for intrusion detection system, pp. 147-153. IACSIT Press, Singapore, 2011.
[18]Y. Bhavsar and K. Waghmare, “Improving performance of support vector machine for intrusion detection using discretization,” International Journal of Engineering Research and Technology (IJERT), vol. 2, no. 12, pp. 2990-2994, 2013.
[19]G. M. Foody, A. Mathur, C. Sanchez-Hernandez, and D. S. Boyd, “Training set size requirements for the classification of a specific class,” Remote Sensing of Environment, vol. 104, no. 1, pp. 1-14, 2006.
[20]G. M. Foody and A. Mathur, “Toward intelligent training of supervised image classifications: directing training data acquisition for svm classification,” Remote Sensing of Environment, vol. 93, no. 1, pp. 107-117, 2004.
[21]A. V. Carneiro, “Estimating sample size in clinical studies: Basic methodological principles,” Rev Port Cardiol, vol. 22, no. 12, pp. 1513-1521, 2003.
[22]J. Cohen, Statistical Power Analysis for the Behavioural Sciences (2nd Ed.). Lawrence Erlbaum Associates, 1988.
[23]K. K. Dobbin, Y. Zhao, and R. M. Simon, “How large a training set is needed to develop a classifier for microarray data?,” Clinical Cancer Research, vol. 14, no. 1, pp. 108-114, 2008.
[24]S.-Y. Kim, “Effects of sample size on robustness and prediction accuracy of a prognostic gene signature,” BMC Bioinformatics, vol. 10, no. 1, p. 147, 2009.
[25]V. Popovici, W. Chen, B. D. Gallas, C. Hatzis, W. Shi, F. W. Samuelson, Y. Nikolsky, M. Tsyganova, A. Ishkin, T. Nikolskaya, K. R. Hess, V. Valero, D. Booser, M. Delorenzi, G. N. Hortobagyi, L. Shi, W. F. Symmans, and L. Pusztai, “Effect of training-sample size and classification difficulty on the accuracy of genomic predictors,” Breast Cancer Research, vol. 12, no. 1, p. R5, 2010.
[26]L. Kanaris, A. Kokkinis, G. Fortino, A. Liotta, and S. Stavrou, “Sample size determination algorithm for fingerprint-based indoor localization systems,” Computer Networks, vol. 101, pp. 169-177, 2016. Industrial Technologies and Applications for the Internet of Things.
[27]C. Beleites, U. Neugebauer, T. Bocklitz, C. Krafft, and J. Popp, “Sample size planning for classification models,” Analytica Chimica Acta, vol. 760, pp. 25-33, 2013.
[28]N. B. Amor, S. Benferhat, and Z. Elouedi, “Naive bayesian networks in intrusion detection systems,” in Workshop on Probabilistic Graphical Models for Classification, 14th European Conference on Machine Learning (ECML), p. 11, 2003.
[29]I. Rish, “An empirical study of the naive bayes classifier,” in IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, pp. 41-46, IBM New York, 2001.
[30]S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T. R. Golub, and J. P. Mesirov, “Estimating dataset size requirements for classifying dna microarray data,” Journal of Computational Biology, vol. 10, no. 2, pp. 119-142, 2003.
[31]C. Cortes, L. D. Jackel, S. A. Solla, V. Vapnik, and J. S. Denker, “Learning curves: Asymptotic values and rate of convergence,” in Advances in Neural Information Processing Systems 6 (J. D. Cowan, G. Tesauro, and J. Alspector, eds.), pp. 327-334, Morgan-Kaufmann, 1994.
[32]G. H. John and P. Langley, “Static versus dynamic sampling for data mining,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 367-370, AAAI Press, 1996.
[33]H. Motulsky and A. Christopoulos, Fitting Models to Biological Data Using Linear and Nonlinear Regression: A Practical Guide to Curve Fitting. Oxford University Press, 2004.
[34]H. J. Motulsky and L. A. Ransnas, “Fitting curves to data using nonlinear regression: a practical and nonmathematical review,” FASEB journal : official publication of the Federation of American Societies for Experimental Biology, vol. 1, no. 5, p. 365374, 1987.
[35]“Nsl-kdd data set.” https://web.archive.org/web/20150205070216/http://nsl.cs.unb.ca/NSL-KDD/. [Online; accessed 21-Sep-2016].
[36]M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd cup 99 data set,” in Proceedings of the Second IEEE International Conference on Computational Intelligence for Security and Defense Applications, CISDA’09, (Piscataway, NJ, USA), pp. 53–58, IEEE Press, 2009.
[37]E. Frank, M. Hall, G. Holmes, R. Kirkby, B. Pfahringer, I. H. Witten, and L. Trigg, Weka-A Machine Learning Workbench for Data Mining, pp. 1269-1277. Boston, MA: Springer US, 2010.
[38]K.-C. Khor, C.-Y. Ting, and S. Phon-Amnuaisuk, The Effectiveness of Sampling Methods for the Imbalanced Network Intrusion Detection Data Set, pp. 613-622. Cham: Springer International Publishing, 2014.
[39]E. A. Rodríguez, “Regression and anova under heterogeneity,” Master’s thesis, Southern Illinois University, Carbondale, Illinois, US, 2007.
[40]L. Janson, W. Fithian, and T. J. Hastie, “Effective degrees of freedom: a flawed metaphor,” Biometrika, vol. 102, no. 2, pp. 479-485, 2015.
[41]B. Efron, “How biased is the apparent error rate of a prediction rule?,” Journal of the American Statistical Association, vol. 81, no. 394, pp. 461-470, 1986.
[42]R. C. Quinino, E. A. Reis, and L. F. Bessegato, “Using the coefficient of determination r2 to test the significance of multiple linear regression,” Teaching Statistics, vol. 35, no. 2, pp. 84-88, 2013.