Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms

Full Text (PDF, 571KB), PP.16-25

Views: 0 Downloads: 0

Author(s)

Abdeslem DENNAI 1,* Mohammed Yacine DENNAI 1 Sidi Mohammed BENSLIMANE 2

1. University of BECHAR, ALGERIA

2. LabRI Laboratory Higher School of Computer, SIDI BEL ABBES, ALGERIA

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2016.11.03

Received: 7 Feb. 2016 / Revised: 10 May 2016 / Accepted: 2 Jul. 2016 / Published: 8 Nov. 2016

Index Terms

Semi-structured web document, term weighting, term frequency, TF-IDF and logic frequency

Abstract

Three classes of documents, based on their data, circulate in the web: Unstructured documents (.Doc, .html, .pdf ...), semi-structured documents (.xml, .Owl ...) and structured documents (Tables database for example). A semi-structured document is organized around predefined tags or defined by its author.
However, many studies use a document classification by taking into account their textual content and underestimate their structure. We attempt in this paper to propose a representation of these semi-structured web documents based on weighted vectors allowing exploiting their content for a possible treatment. The weight of terms is calculated using: The normal frequency for a document, TF-IDF (Term Frequency - Inverse Document Frequency) and logic (Boolean) frequency for a set of documents. To assess and demonstrate the relevance of our proposed approach, we will realize several experiments on different corpus.

Cite This Paper

Abdeslem DENNAI, Mohammed Yacine DENNAI, Sidi Mohammed BENSLIMANE, "Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.11, pp.16-25, 2016. DOI:10.5815/ijitcs.2016.11.03

Reference

[1]Moussa L., Amrane H. and Patrick R., “Un modèle de conception d’application Web basé sur XML”, ISPS’2001 – Alger, Mai. 2001, RIST Vol. 11 Issue 1, 2001.

[2]W3C Recommendation, “eXtensible Markup Language, 5ème Edition”, http : // www.w3.org / TR / 2008/REC-xml-20081126, edited on line Nov. 26 2008, (Consulted June. 2014).

[3]JSON (JavaScript Object Notation), Official WebSite, (Consulted June. 2014).

[4]W3C Recommendation, “Langage de balisage extensible”, http://www.w3.org/TR/1998/REC-xml-19980210, Put on line Feb. 10 1998, (Consulted June. 2014).

[5]Hubert K. and Valérie M., “Les web services. Techniques, démarches et outils XML, WSDL, SOAP, UDDI, RosettaNet, UML”, Dunod 2003.

[6]Gagnon O., “Indexation de documents web à l’aide d’ontologies”, Maitrise en sciences appliquées, Ecole Polytechnique de Montréal, CANADA, 2013.

[7]Chagheri S., Roussey C., Calabretto S. and Dumoulin C, “Classification de documents combinant la structure et le contenu”, 2012.

[8]Vercoustre A. M., Fegas M., Lechevallier Y. and Despeyroux T., “Classification de documents XML à partir d’une représentation linéaire des arbres de ces documents”, 2006.

[9]Denoyer L., Wisniewski G. and Gallinari P., “Classification automatique de structures arborescentes à l’aide du noyau de Fisher : Application aux documents XML”, 6th European Congress on Systems Science, Sep. 19-22, 2005.

[10]Dennai A. and Benslimane S. M., “Information extraction from HTML pages or XML documents by a semantic indexing, using domain ontology”, 3rd International Conference on Multimedia Computing and Systems ICMCS’2012, IEEE conference, Tangier, Morocco, 10- 12 Mai 2012.

[11]Dennai A. and Benslimane S. M., “Building a Semantic Index from HTML Pages or XML Documents”, International Conference on Computing Technology and Information Management, ICCTIM 2014, Dubai, E.A.U, 09- 11 April 2014.

[12]Dennai A. and Benslimane S. M., “Semantic Indexing of Web Documents Based on Domain Ontology”, International Journal of Information Technology and Computer Science (IJITCS), ISSN: 2074‐9007 (Print), ISSN: 2074‐9015 (Online), DOI: 10.5815/ijitcs, Published By: MECS Publisher, IJITCS Vol. 7 Issue 2, Jan. 2015.