Sentence Clustering Using Parts-of-Speech

Full Text (PDF, 204KB), PP.1-9

Views: 0 Downloads: 0

Author(s)

Richard Khoury 1,*

1. Department of Software Engineering, Lakehead University, Thunder Bay (ON), Canada

* Corresponding author.

DOI: https://doi.org/10.5815/ijieeb.2012.01.01

Received: 2 Nov. 2011 / Revised: 8 Dec. 2011 / Accepted: 3 Jan. 2012 / Published: 8 Feb. 2012

Index Terms

Natural language processing, Part-of-speech, clustering

Abstract

Clustering algorithms are used in many Natural Language Processing (NLP) tasks. They have proven to be popular and effective tools to use to discover groups of similar linguistic items. In this exploratory paper, we propose a new clustering algorithm to automatically cluster together similar sentences based on the sentences’ part-of-speech syntax. The algorithm generates and merges together the clusters using a syntactic similarity metric based on a hierarchical organization of the parts-of-speech. We demonstrate the features of this algorithm by implementing it in a question type classification system, in order to determine the positive or negative impact of different changes to the algorithm.

Cite This Paper

Richard Khoury,"Sentence Clustering Using Parts-of-Speech", International Journal of Information Engineering and Electronic Business(IJIEEB), vol.4, no.1, pp.1-9, 2012. DOI:10.5815/ijieeb.2012.01.01

Reference

[1]Suzuki, M., Kuriyama, N., Ito, A., Makino, S. Automatic clustering of part-of-speech for vocabulary divided PLSA language model. International Conf. on Natural Language Processing and Knowledge Engineering, 2008, pp. 1-7.

[2]Chen, M, Song, Y. Summarization of text clustering based vector space model. IEEE 10th International Conference on Computer-Aided Industrial Design & Conceptual Design, 2009, pp.2362-2365.

[3]You, C. H., Lee, K. A., Ma, B., Li, H. Self-Organized Clustering for Feature Mapping in Language Recognition. 6th International Symposium on Chinese Spoken Language Processing, 2008, pp. 1-4.

[4]Khoury, R., Karray, F., Kamel, M. Keyword extraction rules based on a part-of-speech hierarchy. International Journal of Advanced Media and Communication. 2008, 2(2):138—153.

[5]Krishnan, V., Das, S., Chakrabarti, S. Enchanced Answer Type Inference from Questions using Sequential Models. Proceedings of Human Language Technology Conference / Conference on Empirical Methods in Natural Language Processing, 2005, pp. 315—322.

[6]Yin, B., Ambikairajah, E. Chen, F. Improvements on hierarchical language identification based on automatic language clustering. IEEE International Conf. on Acoustics, Speech and Signal Processing, 2008, pp. 4241-4244.

[7]Froud, H. R., Benslimane, A., Lachkar, A., Ouatik, S. A. Stemming and similarity measures for Arabic Documents Clustering. 5th International Symposium on I/V Communications and Mobile Network, 2010, pp. 1-4.

[8]Meedeniya, D.A., Perera, A.S. Evaluation of Partition-Based Text Clustering Techniques to Categorize Indic Language Documents. IEEE International Advance Computing Conference, 2009, pp. 1497-1500.

[9]Razmara, M., Fee, A., Kosseim, L. Concordia University at the TREC 2007 QA track. Proceedings of the Sixteenth Text REtrieval Conference, 2007.

[10]Liang, Z., Lang, Z., Jia-Jun, C. Structure analysis and computation-based Chinese question classification. Sixth International Conference on Advanced Language Processing and Web Information Technology, 2007, pp. 39—44.

[11]Tomuro, N. Question terminology and representation of question type classification. Second International Workshop on Computational Terminology, 2002, vol. 14.

[12]Harabagiu, S., Moldovan, D., Pasca, M., Mihalcea, R., Surdeanu, M., Bunescu, R., Girju, R., Rus, V., Morarescu, P. Falcon: Boosting knowledge for answer engines. Proceedings of the 9th Text REtrieval Conference (TREC-9). 2000, pp. 479–488.

[13]Zhang D., Nunamaker J. F. A Natural language approach to content-based video indexing and retrieval for interactive e-learning. IEEE Transactions on Multimedia. 2004, 6(3):450—458.

[14]Khoury, R. A Learning Algorithm for Question Type Classification. Proceedings of the 2011 International Conference on Artificial Intelligence, 2011, 1:265-371. 

[15]Marcus, M., Santorini, B., Marcinkiewicz, M. A. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics. 1993, 19(2):313—330.

[16]Dang, H. T., Kelly, D., Lin, J. Overview of the TREC 2007 Question Answering Track. Proceedings of the Sixteenth Text REtrieval Conference (TREC 2007), 2007.