A New Dynamic Data Cleaning Technique for Improving Incomplete Dataset Consistency

Full Text (PDF, 391KB), PP.60-68

Views: 0 Downloads: 0

Author(s)

Sreedhar Kumar S 1,* Meenakshi Sundaram S 2

1. KS School of Engineering and Management /Department of CSE, Bangalore, 560106, India

2. GSSS Institute of Engineering and Technology for Woman / Department of CSE, Mysuru, 570016, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2017.09.06

Received: 5 May 2017 / Revised: 11 Jun. 2017 / Accepted: 3 Jul. 2017 / Published: 8 Sep. 2017

Index Terms

Dataset Quality Measure, Identify Normal Object, Missing Attributes, Object Consistency, Object Inconsistency, Outlier, Reconstruct Normal Object

Abstract

This paper presents a new approach named Dynamic Data Cleaning (DDC) aims to improve incomplete dataset consistency by identifying, reconstructing and removing inconsistent data objects for future data analysis process. The proposed DDC approach consists of three methods:  Identify Normal Object (INO), Reconstruct Normal Object (RNO) and Dataset Quality Measure (DQM).  The first method INO divides the incomplete dataset into normal objects and abnormal objects (outliers) based on degree of missing attributes values in each individual object. Second, the  (RNO) method reconstructs missed attributes values in the normal objects by the closest object based on a distance metric and removes inconsistent data objects (outliers) with higher missed data. Finally, the DQM method measures the consistency and inconsistency among the objects in improved dataset with and without outlier. Experimental results show that the proposed DDC approach is suitable to identify and reconstruct the incomplete data objects for improving dataset consistency from lower to higher level without user knowledge.

Cite This Paper

Sreedhar Kumar S, Meenakshi Sundaram S, "A New Dynamic Data Cleaning Technique for Improving Incomplete Dataset Consistency", International Journal of Information Technology and Computer Science(IJITCS), Vol.9, No.9, pp. 60-68, 2017. DOI:10.5815/ijitcs.2017.09.06

Reference

[1]Mohammed A. AlGhamdi, “Pre-Processing Methods of Data Mining,” IEEE/ACM 7th International Conference on Utility and Cloud Computing,  pp. 452-456, 2014.

[2]I. Ahmed and A. Aziz, “Dynamic approach for data scrubbing process,” International Journal on Computer Science and Engineering, ISSN: 0975-3397, 2010.    

[3]B. Everett, Cluster Analysis, John Wiley and Sons, Inc., 1993. 

[4]W. Kim, B.J. Choi, E.K. Hong, S.K. Kim, and D. Lee, “A taxonomy of dirty data”, Data mining and knowledge discovery, vol. 7, no. 1, 2003, pp. 81–99. 

[5]Edwin-de-Jonge and Mark-van-der-loo, “An introduction to data cleaning with R,” Statistics Netherland, 2013.

[6]R. J. A. Little, “Missing-data adjustments in large surveys,” Journal of Business and Economic Statistics, vol. 6, no. 3, pp. 287-296, 1988.

[7]https://en.wikipedia.org/wiki/Missing_data.

[8]https://en.wikipedia.org/wiki/Imputation_(statistics)

[9]https://en.wikipedia.org/wiki/Expectation-maximization     [10]https://en.wikipedia.org/wiki/Interpolation

[11]W. Young, G. Weckman, W. Holland, “ A survey of methodologies for the treatment of missing values within datasets: limitations and benefits,” Theoretical Issues in Ergonomics Science,  vol. 12, no. 1,  pp. 15-43, 2011.

[12]Darwiche Adnan, Modeling and Reasoning with Bayesian Networks, Cambridge University Press, 2009.

[13]Koller, Daphne and Friedman, Nir, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009.

[14]Murphy, Kevin Patrick,"Machine Learning: A Probabilistic  Perspective, “ MIT Press, 2012.

[15]K. Mohan, G. Van den Broeck, A. Choi, J. Pearl, “An  Efficient Method for Bayesian Network Parameter Learning from Incomplete Data,” International  Conference  on Machine learning Workshop, 2014.

[16]D.B. Rubin, Multiple imputations for nonresponse in surveys, New York: Wiley, 1987.

[17]J. L. Schafer, M. K. Olsen, Multiple imputations for multivariate missing data problems: A data analyst’s perspective. Multivariate Behavioral Research, vol. 33, pp. 545–571, 1998.

[18]J. W. Graham, A. E. Olchowski, T. D. Gilreath, “How Many Imputations are really needed? Some Practical Clarifications of Multiple Imputation Theory,” Preventative Science, vol. 8, no. 3,  pp. 206-2013,2007.

[19]L. M. Collins, J. L. Schafer, L. M. Kam, “A comparison of inclusive and restrictive strategies in modern missing data procedures,” Psychological Methods,    vol. 6, no. 4, pp. 330-351, 2001.

[20]J. W. Graham, “Adding missing data relevant variables to FIML based Structural equation models,” Structural Equation Modeling, vol. 10, pp.  80–100, 2003.

[21]E. Mirkes, T. J. Coats, J. Levesley, A. N. Gorban, “Handling missing data inlarge healthcare dataset: A case study of unknown trauma outcomes,” Computers in Biology and Medicine. vol.75, pp. 203–216, 2016, DOI:10.1016/j.compbiomed. 2016.06.004.

[22]http://www.ics.uci/mamographicsmasses / ML Repository .html

[23]Melissa J Azur, Elizabeth A Stuart, Constantine Frangakis, Philip J Leaf, “ Multiple imputation by chained equations: what is it and how does it work?,” International Journal of MMethods in Psychiatric Research, vol. 20, no. 1, 2011, pp. 40-49, DOI: 10.1002/MPR.329.

[24]Michael G Kenward, “The handling of missing data in clinical trials,” Clinical Investigation, vol. 3, no. 3, 2013, pp. 241-250, DOI: 10.4155/cli.13.7.

[25]Sameer Dixit, Navjot Gwal, “An Implementation of Data Pre-Processing for Small Dataset,” International Journal of Computer Application, vol. 103, no. 6, pp. 28-31, 2014.

[26]R. Kavitha Kumar and R. M. Chadrasekaran, “Attribute Correction Data Cleaning Using Association Rule and Clustering Methods,” International Journal of Data Mining & Knowledge Management Process, vol. 1, no. 2, pp. 22-32, 2011, DOI:10.5121/ijdkp.2011.1202

[27]Anosh Fatima, Nosheen Nazir, Muhammad Gufran Khan, “Data Cleaning in Data Warehouse: A Survey of Data Pre-Processing” Journal of Information Technology and Computer Science (IJITCS), vol. 9, no. 3, pp. 50-61, 2017, DOI: 10.5815/ijitcs.2017.03.06.