String Variant Alias Extraction Method using Ensemble Learner

Full Text (PDF, 321KB), PP.59-65

Views: 0 Downloads: 0

Author(s)

P.Selvaperumal 1,* A.Suruliandi 1

1. Department of Computer science and Engineering, Manonmaniam Sundaranar University Tirunelveli, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2016.02.08

Received: 10 May 2015 / Revised: 5 Sep. 2015 / Accepted: 17 Nov. 2015 / Published: 8 Feb. 2016

Index Terms

String variant alias, name disambiguation, Entity disambiguation, Information extraction

Abstract

String variant alias names are surnames which are string variant form of the primary name. Extracting string variant aliases are important in tasks such as information retrieval, information extraction, and name resolution etc. String variant alias extraction involves candidate alias name extraction and string variant alias validation. In this paper, string variant aliases are first extracted from the web and then using seven different string similarity metrics as features, candidate aliases are validated using ensemble classifier random forest. Experiments were conducted using string variant name-alias dataset containing name-alias data for 15 persons containing 30 name-alias pairs. Experimental results show that the proposed method outperforms other similar methods in terms of accuracy.

Cite This Paper

P.Selvaperumal, A.Suruliandi, "String Variant Alias Extraction Method using Ensemble Learner", International Journal of Intelligent Systems and Applications(IJISA), Vol.8, No.2, pp.59-65, 2016. DOI:10.5815/ijisa.2016.02.08

Reference

[1]Bollegala, Danushka, Yutaka Matsuo, and Mitsuru Ishizuka. "Automatic discovery of personal name aliases from the web." IEEE Transactions on Knowledge and Data Engineering 23, no. 6 (2011): 831-844.
[2]Hsiung, Paul, Andrew Moore, Daniel Neill, and Jeff Schneider. "Alias detection in link data sets." In Proceedings of the International Conference on Intelligence Analysis, vol. 4, no. 4.6. 2005.
[3]Bhat, Vinay, Tim Oates, Vishal Shanbhag, and Charles Nicholas. "Finding aliases on the web using latent semantic analysis." Data & Knowledge Engineering 49, no. 2 (2004): 129-143.
[4]Ning an, Lilli Jiang and Jianyonngwang ,”Towards detecting of alias without string similarity”, Information science Journal, March 2014, Pages 89–100
[5]Ristad, Eric Sven, and Peter N. Yianilos. "Learning string-edit distance." Pattern Analysis and Machine Intelligence, IEEE Transactions on 20, no. 5 (1998): 522-532.
[6]Navarro, Gonzalo. "A guided tour to approximate string matching." ACM computing surveys (CSUR) 33, no. 1 (2001): 31-88.
[7]Cohen, William, Pradeep Ravikumar, and Stephen Fienberg. "A comparison of string metrics for matching names and records." In Kdd workshop on data cleaning and object consolidation, vol. 3, pp. 73-78. 2003.
[8]Jokinen, Petteri, JormaTarhio, and EskoUkkonen. "A comparison of approximate string matching algorithms." Software: Practice and Experience 26, no. 12 (1996): 1439-1458.
[9]Pfeifer, Ulrich, Thomas Poersch, and Norbert Fuhr. "Retrieval effectiveness of proper name search methods." Information Processing & Management 32, no. 6 (1996): 667-679.
[10]Angell, Richard C., George E. Freund, and Peter Willett. "Automatic spelling correction using a trigram similarity measure." Information Processing & Management 19, no. 4 (1983): 255-261.
[11]H.B. Newcombe, Handbook of Record Linkage. Oxford Univ. Press,1988.
[12]Lait, A. J., and Brian Randell. "An assessment of name matching algorithms." Technical Report Series-University of Newcastle Upon Tyne Computing Science (1996).
[13]Elmagarmid, Ahmed K., Panagiotis G. Ipeirotis, and Vassilios S. Verykios. "Duplicate record detection: A survey." Knowledge and Data Engineering, IEEE Transactions on 19, no. 1 (2007): 1-16.
[14]G. Salton. Automatic text transformations. In G. Salton, editor, AutomaticText Processing: The Transformation, Analysis, and Retrieval of Informationby Computer, pages 425–470. Addison-Wesley, MA, USA, 1988.
[15]Lait, A. J., and Brian Randell. "An assessment of name matching algorithms." Technical Report Series-University of Newcastle Upon Tyne Computing Science (1996).
[16]Du, Mengmeng. "Approximate Name Matching-Finding Similar Personal Names in Large International Name Lists."
[17]Lu, Wei, Xiaoyong Du, MariosHadjieleftheriou, and B. Ooi. "Efficiently Supporting Edit Distance based String Similarity Search Using B+-trees." (2014): 1-1.
[18]Christen, Peter. "A comparison of personal name matching: Techniques and practical issues." In Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, pp. 290-294. IEEE, 2006.
[19]W.E. Yancey, “Evaluating String Comparator Performance for Record Linkage,” Technical Report Statistical Research Report Series RRS2005/05, US Bureau of the Census, Washington, D.C., June 2005.
[20]Bilenko, Mikhail, Raymond Mooney, William Cohen, Pradeep Ravikumar, and Stephen Fienberg. "Adaptive name matching in information integration." IEEE Intelligent Systems 18, no. 5 (2003): 16-23.
[21]Yin, Meijuan, Junyong Luo, Ding Cao, Xiaonan Liu, and Yongxing Tan. "User Name Alias Extraction in Emails." International Journal of Image, Graphics and Signal Processing (IJIGSP) 3, no. 3 (2011): 1.
[22]Govindarajan, M. "A Hybrid RBF-SVM Ensemble Approach for Data Mining Applications." International Journal of Intelligent Systems and Applications (IJISA) 6, no. 3 (2014): 84.
[23]Tan, Aik Choon, and David Gilbert. "Ensemble machine learning on gene expression data for cancer classification." (2003).
[24]Caruana, Rich, and Alexandru Niculescu-Mizil. "An empirical comparison of supervised learning algorithms." In Proceedings of the 23rd international conference on Machine learning, pp. 161-168. ACM, 2006.
[25]Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." In Soviet physics doklady, vol. 10, no. 8, pp. 707-710. 1966.
[26]Smith, Temple F., and Michael S. Waterman. "Identification of common molecular subsequences." Journal of molecular biology 147, no. 1 (1981): 195-197.
[27]M.A. Jaro, “Unimatch: A Record Linkage System: User’s Manual,”technical report, US Bureau of the Census, Washington, D.C., 1976.
[28]Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to information retrieval. Vol. 1. Cambridge: Cambridge university press, 2008.
[29]Dietterich, Thomas G. "Ensemble methods in machine learning." In Multiple classifier systems, pp. 1-15. Springer Berlin Heidelberg, 2000.
[30]http://scikit-learn.org/stable/modules/ensemble.html
[31]Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee. "Boosting the margin: A new explanation for the effectiveness of voting methods." Annals of statistics (1998): 1651-1686.
[32]Breiman, Leo. "Bagging predictors." Machine learning 24, no. 2 (1996): 123-140.
[33]Bauer, Eric, and Ron Kohavi. "An empirical comparison of voting classification algorithms: Bagging, boosting, and variants." Machine learning 36, no. 1-2 (1999): 105-139.
[34]Džeroski, Saso, and Bernard Ženko. "Is combining classifiers with stacking better than selecting the best one?." Machine learning54, no. 3 (2004): 255-273.[35]Breiman, Leo. "Random forests." Machine learning 45, no. 1 (2001): 5-32.
[36]Oshiro, Thais Mayumi, Pedro Santoro Perez, and José Augusto Baranauskas. "How many trees in a random forest?." In MLDM, pp. 154-168. 2012.
[37]Witten, Ian H., and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.