Retrieval of Complex Named Entities on the Web: Proposals for Similarity Computation

Full Text (PDF, 979KB), PP.1-14

Views: 0 Downloads: 0

Author(s)

Armel Fotsoh 1,* Christian Sallaberry 2 Annig Le Parc Lacayrelle 2

1. RECITAL, 34 Boulevard de Bonne Nouvelle, 75010 Paris, France

2. Laboratoire d'Informatique de l'Université de Pau et des Pays de l'Adour, EA 3000, 64000 Pau, France

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2019.11.01

Received: 25 Jul. 2019 / Revised: 17 Aug. 2019 / Accepted: 25 Aug. 2019 / Published: 8 Nov. 2019

Index Terms

Complex Named Entities, Similarity Computation, Machine Learning, Web Mining

Abstract

As part of the Cognisearch project, we developed a general architecture dedicated to extracting, indexing and searching for complex Named Entities (NEs) in webpages. We consider complex NEs as NEs represented by a list of properties that can be single values (text, number, etc.), "elementary" NEs and/or other complex NEs. Before the indexing of a new extracted complex NE, it is important to make sure that it is not already indexed. Indeed, the same NE may be referenced on several different web platforms. Therefore, we need to be able to establish similarity to consolidate information related to similar complex NEs. This is the focus of this paper. Two issues mainly arise in the computation of similarity between complex NEs: (i) the same property may be expressed differently in the compared NEs; (ii) some properties may be missing. We propose several generic similarity computation approaches that target any type of complex NEs. The two issues outlined above are tackled in these proposals. We experiment and evaluate these approaches with two examples of complex NEs related to the domain of social events.

Cite This Paper

Armel Fotsoh, Christian Sallaberry, Annig Le Parc Lacayrelle, "Retrieval of Complex Named Entities on the Web: Proposals for Similarity Computation", International Journal of Information Technology and Computer Science(IJITCS), Vol.11, No.11, pp.1-14, 2019. DOI:10.5815/ijitcs.2019.11.01

Reference

[1]A. Agresti and M. Kateri. Categorical Data Analysis. Springer, Berlin, Heidelberg, 2011. [2] M. S. Bartlett. A note on the multiplying factors for various χ 2 approximations. Journal of the Royal Statistical Society. Series B (Methodological), 1954. 

[2]K. Beard and V. Sharma. Multidimensional ranking for data in digital spatial libraries. International Journal on Digital Libraries, pages 153–160, 1997. 

[3]H. Becker, M. Naaman, and L. Gravano. Learning similarity metrics for event identification in social media. In 3th ACM International Conference on Web search and data mining, pages 291–300. ACM, 2010. 

[4]H. Bulskov, R. Knappe, and T. Andreasen. On measuring similarity for conceptual querying. In International Conference on Flexible Query Answering Systems, pages 100–111. Springer, 2002. 

[5]N. Chinchor and P. Robinson. Muc-7 named entity task definition. In 7th Conference on Message Understanding, volume 29, 1997. 

[6]N. R. Chopde and M. Nichat. Landmark based shortest path detection by using a* and haversine formula. International Journal of Innovative Research in Computer and Communication Engineering, 1(2):298–302, 2013. 

[7]W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matching names and records. In KDD Workshop on data cleaning and object consolidation, volume 3, pages 73–78, 2003. 

[8]C. Da Costa Pereira, M. Dragoni, and G. Pasi. Multidimensional relevance: Prioritized aggregation in a personalized information retrieval setting. Information Processing & Management, 48(2):340–357, 2012. 

[9]J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves. In 23th International Conference on Machine Learning (ICML 2006), Pittsburgh, Pennsylvania, USA, 2006. 

[10]Y. Dupont. La structuration dans les entités nommées. PhD thesis, Paris 3, 2017. 

[11]R. Eberhart and J. Kennedy. A new optimizer using particle swarm theory. In 6th International Symposium on Micro Machine and Human Science, pages 39–43. IEEE, 1995. 

[12]H. Federer. Geometric measure theory. Springer, 2014. 

[13]A. Fotsoh, C. Sallaberry, and A. Le Parc-Lacayrelle. Named entity similarity computation: The case of social event entities. In 11th Workshop on Geographic Information Retrieval, (GIR 2017), Heidelberg, Germany, 2017. 

[14]E. A. Fox and J. A. Shaw. Combination of Multiple Searches. In D. K. Harman, editor, 1st Text REtrieval Conference, pages 243–252, Gaithersburg, MD, USA, 1993. 

[15]J. Friedman and B. E. Popescu. Gradient directed regularization for linear regression and classification. Technical report, Citeseer, 2003. 

[16]C. Gupta and R. Grossman. Genic: A single pass generalized incremental algorithm for clustering. In International Conference on Data Mining (SIAM). SIAM, 2004.

[17]M. Halkidi, B. Nguyen, I. Varlamis, and M. Vazirgiannis. Thesus: Organizing web document collections based on link semantics. International Journal on Very Large DataBases (VLDB), 2003. 

[18]L. Hill. Access to Geographic Concepts in Online Bibliographic Files: effectiveness of current practices and the potential of a graphic interface. PhD thesis, University of Pittsburgh, USA, 1990. 

[19]P. Jaccard. Bulletin de la société vaudoise des sciences naturelles. Etude comparative de la distribution florale dans une portion des Alpes et des Jura, 37:547–579, 1901. 

[20]M. A. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, 84(406):414–420, 1989. 

[21]H. Khrouf and R. Troncy. De la modélisation sémantique des événements vers l’enrichissement et la recommandation. Revue d’Intelligence Artificielle, 2014. 

[22]A. Le Parc-Lacayrelle, M. Gaio, and C. Sallaberry. La composante temps dans l’information géographique textuelle. Document Numérique, 2007. 

[23]J. H. Lee. Analyses of multiple evidence combination. In ACM SIGIR Forum, volume 31, pages 267–276. ACM, 1997. 

[24]V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710, 1966. 

[25]A. E. Monge and C. Elkan. The field matching problem: Algorithms and applications. In KDD-96, pages 267–270, 1996. 

[26]V. T. Nguyen, C. Sallaberry, and M. Gaio. Mesure de la similarité entre termes et labels de concepts ontologiques. In 10th Conference en Recherche d’Infomations et Applications (CORIA 2013), pages 415–430, Neuchâtel, Suisse, 2013. 

[27]J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. 

[28]R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1):17–30, 1989. 

[29]S. Rueben and G. Jakobson. Digital maps displaying searchresulting points-of-interest in user delimited regions, 2013. US Patent 8,510,045. 

[30]C. Sallaberry, M. Gaio, D. Palacio, and J. Lesbegueries. Fuzzying GIS topological functions for GIR needs. In 5th ACM Workshop On Geographic Information Retrieval (GIR 2008), Napa Valley, California, USA, 2008. 

[31]G. Salton. Introduction to modern information retrieval. McGraw-Hill, 1983. 

[32]G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988. 

[33]T. Scheffler, R. Schirru, and P. Lehmann. Matching points of interest from different social networking sites. Conference on Artificial Intelligence - KI 2012: Advances in Artificial Intelligence, 2012. 

[34]S. Sekine, K. Sudo, and C. Nobata. Extended named entity hierarchy. In 3rd International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, 2002. 

[35]L. Serrano, M. Bouzid, T. Charnois, S. Brunessaux, and B. Grilhères. Events extraction and aggregation for open source intelligence: From text to knowledge. In 25th International Conference on Tools with AI, Herndon, VA, USA, 2013. 

[36]K. Sun, Y. Zhu, and J. Song. Progress and challenges on entity alignment of geographic knowledge bases. ISPRS International Journal of Geo-Information, 8(2):77, 2019. 

[37]D. Walker, I. Newman, D. Medyckyj-Scott, and C. Ruggles. A system for identifying datasets for gis users. International Journal of Geographical Information Systems, 1992. 

[38]J. Wang, K. Chen, E. Kayis, G. Gallego, J. Guerrero, R. Wang, and S. Jain. Tree-based regression, 2013. US Patent App. 13/528,972. 

[39]W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, US Census Bureau, 1999. 

[40]Z. Wu and M. Palmer. Verbs semantics and lexical selection. In 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94, pages 133–138, Stroudsburg, PA, USA, 1994. 

[41]L. Yu, P. Qiu, X. Liu, F. Lu, and B. Wan. A holistic approach to aligning geospatial data with multidimensional similarity measuring. International journal of digital earth, 11(8):845– 862, 2018. 

[42]C. Zhang, G. Zhou, Q. Yuan, H. Zhuang, Y. Zheng, L. Kaplan, S. Wang, and J. Han. Geoburst: Real-time local event detection in geo-tagged tweet streams. In 39th International Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 513–522, 2016. 

[43]Y. Zhu, A.-X. Zhu, J. Song, J. Yang, M. Feng, K. Sun, J. Zhang, Z. Hou, and H. Zhao. Multidimensional and quantitative interlinking approach for linked geospatial data. International Journal of Digital Earth, 10(9):923–943, 2017.