A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop

Full Text (PDF, 770KB), PP.81-90

Views: 0 Downloads: 0

Author(s)

Lija Mohana 1,* Sudheep Elayidom M. 1

1. Division of Computer Science, Cochin University of Science & Technology, Kochi, Kerala, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2016.09.10

Received: 3 Nov. 2015 / Revised: 23 Feb. 2016 / Accepted: 12 Apr. 2016 / Published: 8 Sep. 2016

Index Terms

BigData Analysis, Bank customer classification, Hadoop, PIG, R

Abstract

Large amount of data that is characterized by its volume, velocity, veracity, value and variety is termed Big Data. Extracting hidden patterns, customer preferences, market trends, unknown correlations, or any other useful business information from large collection of structured or unstructured data set is called Big Data analysis. This article explores the scope of analyzing bank transaction data to categorize customers which could help the bank in efficient marketing, improved customer service, better operational efficiency, increased profit and many other hidden benefits. Instead of relying on a single technology to process large scale data, we make use of a combination of strategies like Hadoop, PIG, R etc for efficient analysis. RHadoop is an upcoming research trend for Big Data analysis, as R is a very efficient and easy to code, data analysis and visualization tool compared to traditional MapReduce program. K-Means is chosen as the clustering algorithm for classification.

Cite This Paper

Lija Mohan, Sudheep Elayidom M., "A Novel Big Data Approach to Classify Bank Customers - Solution by Combining PIG, R and Hadoop", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.9, pp.81-90, 2016. DOI:10.5815/ijitcs.2016.09.10

Reference

[1]Alan Gates (2011), Programming PIG, O’Reilly Media, New York.

[2]Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar and Andrew Tomkins, “Pig Latin: A Not-So-Foreign Language for Data Processing”, SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.

[3]Emmanual Paradis, “R for Beginners”, http://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

[4]W. N. Venables, D. M. Smith and the R Core Team, “An Introduction to R”, http://cran.r-project.org/doc/manuals/R-intro.pdf

[5]Vignesh Prajapati (2013), “Big Data Analytics with R and Hadoop”, Packt Publishing, UK.

[6]Jiawei Han, Micheline Kamber (2006), “Data Mining Concepts & Techniques”, Morgan Kaufmann Publishers, Canada.

[7]J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297.

[8]Brian T. Luke: “K-Means Clustering”, http://fconyx.ncifcrf.gov/~lukeb/kmeans.html

[9]S. Lloyd, Least square quantization in PCM, IEEE Trans. Infor. Theory, 28, 1982, pp. 129– 137.

[10]Charu C. Aggarwal (2013), “Outlier Analysis”, Kluwer Academic Publishers, Boston.

[11]Purple Math: “Box Plot and 5 number summary”, http://www.purplemath.com/modules/boxwhisk.htm

[12]Berka, P. (2000). Guide to the financial data set. The ECML/PKDD 2000 Discovery Challenge.

[13]Robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas,“Hadoop Distributed File System”, http://www.aosabook.org/en/hdfs.html

[14]Jeffrey Shafer, Scott Rixner, and Alan L. Cox, “The Hadoop Distributed Filesystem: Balancing Portability and Performance”, ISPASS 2010, New York, USA, Pages 122-133.

[15]Y. Wang, S. Wang, and K.K. Lai, “A New Fuzzy Support Vector Machine to Evaluate Credit Risk,” IEEE Trans. Fuzzy Systems, vol. 13, no. 6, pp. 820-831, Dec. 2005.

[16]T. Joachims, “Making Large-Scale SVM Learning Practical,” Advances in Kernel Methods: Support Vector Learning, B. ScholLkopf, C. Burges, and A. Smola, eds., MIT-Press, 1999.

[17]L. Yu, S. Wang, and K. Lai, “Credit Risk Assessment with a Multistage Neural Network Ensemble Learning Approach,” Expert Systems with Applications, vol. 34, no. 2, pp. 1434-1444, 2008.

[18]H. Guo and S.B. Gelfand, “Classification Trees with Neural Network Feature Extraction,” IEEE Trans. Neural Networks, vol. 3, pp. 923-933, 1992.

[19]R. Rymon, “An SE-Tree Based Characterization of the Induction Problem,” Proc. Int’l Conf. Machine Learning, 1993.

[20]J.R. Quilan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.

[21]Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, B.V. Dasarathy, ed. IEEE Computer Society Press, 1991.

[22]D. Martens, M. De Backer, R. Haesen, J. Vanthienen, M. Snoeck, and B. Baesens, “Classification with Ant Colony Optimization,” IEEE Trans. Evolutionary Computation, vol. 11, no. 5, pp. 651-665, Oct. 2007.

[23]D. Martens, B.B. Baesens, and T. Van Gestel, “Decomposition Rule Extraction from Support Vector Machines by Active Learning,” IEEE Trans. Knowledge and Data Eng., vol. 21, no. 2, pp. 178-191,Dec. 2008.

[24]D. Pedro and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Machine Learning, vol. 29, pp. 103-137, 1997.

[25]G. Guo, “CR Dyer, Learning from Examples in the Sample Case: Face Expression Recognition,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 35, no. 3, pp. 477-488, June 2005.

[26]W.A. Chaovalitwon Se, Y.-J. Fan, and R.C. Sachdeo, “Support Feature Machine for Classification of Abnormal Brain Activity,” Proc. ACM SIGMOD, pp. 113-122, 2007.

[27]B.P. Rachel, T. Shlomo, R. Alex, L. Anna, and K. Patrick, “Multiplex Assessment of Serum Biomarker Concentrations in Well-Appearing Children with Inflicted Traumatic Brain Injury,” Pediatric Research, vol. 65, no. 1, pp. 97-102, 2009.

[28]S. Olafsson, X. Li, and S. Wu, “Operations Research and Data Mining,” European J. Operational Research, vol. 187, no. 3, pp. 1429-1448, 2008.

[29]A. Benos and G. Papanastasopoulos, “Extending the Merton Model: A Hybrid Approach to Assessing Credit Quality,” Math and Computer Modelling, vol. 48, pp. 47-68, 2007.

[30]R. Rymon, “An SE-Tree Based Characterization of the Induction Problem,” Proc. Int’l Conf. Machine Learning, 1993.

[31]J.R. Quilan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.

[32]Apache Hadoop Architecture : https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

[33]Google’s Technical Paper on Map Reduce, http://research.google.com/archive/mapreduce.html

[34]Kumagai, J., “Mission impossible? [FBI computer network]”, IEEE Spectrum, Volume: 40, Issue: 4, April 2003.

[35]J. Cohen,“Graph Twiddling in a MapReduce World.”, Computing in Science & Engineering (Volume:11 , Issue: 4 ), June 2009.

[36]Shunmei Meng; Wanchun Dou ; Xuyun Zhang ; Jinjun Chen, “KASR: A Keyword-Aware Service Recommendation Method on MapReduce for Big Data Applications”, IEEE Transactions on   Parallel and Distributed Systems, Volume:25 ,  Issue: 12, Dec. 2014

[37]Hormozi, H, Akbari, M.K., Hormozi, E.., Javan, M.S, “Credit cards fraud detection by negative selection algorithm on hadoop (To reduce the training time)”, IEEE 5th Conference on Information and Knowledge Technology (IKT), May, 2013. 

[38]Conejero, J.; Burnap, P.; Rana, O.; Morgan, J., “Scaling Archived Social Media Data Analysis Using a Hadoop Cloud”, IEEE Sixth International Conference on Cloud Computing (CLOUD), June, 2013. 

[39]Xu, J.; Yu, Y.; Chen, Z.; Cao, B.; Dong, W.; Guo, Y.; Cao, J.,”MobSafe: cloud computing based forensic analysis for massive mobile applications using data mining”, Tsinghua Science and Technology, Volume: 18, Issue: 4 , June 2013.

[40]RBI Bank Transaction Statistics, http://www.rbi.org.in/scripts/NEFTUserView.aspx?Id=82.

[41]What is BigData?, http://www.sas.com/en_us/insights/big-data/what-is-big-data.html.