Authorship Attribution for Bengali Language Using the Fusion of N-Gram and Naive Bayes Algorithms

Full Text (PDF, 615KB), PP.11-21

Views: 0 Downloads: 0

Author(s)

D. M. Anisuzzaman 1,* Abdus Salam 2

1. Ahsanullah University of Science and Technology, Department of Computer Science and Engineering, Dhaka, Bangladesh

2. American International University-Bangladesh, Department of Computer Science, Dhaka, Bangladesh

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2018.10.02

Received: 8 Feb. 2018 / Revised: 11 May 2018 / Accepted: 12 Aug. 2018 / Published: 8 Oct. 2018

Index Terms

Naive Bayes, n gram, authorship attribution, bengali language, natural language processing

Abstract

This research shows the authorship attribution for three Bengali writers using both Naïve Bayes method and a new method proposed by us which performs better than Naïve Bayes for authorship attribution. Though a lot of works exist in the field of authorship attribution for other languages (especially English); the amount of work in this field for Bengali language is very low. For this experiment, we make our own dataset having 107380 words and 21198 unique words. For both methods, we pre-process our dataset to be compatible to work with the method experiments. For our dataset, Naïve Bayes gives an accuracy of 86% while our method gives an accuracy of 95%. The main inspiration behind our method is that every author has a nature to write some adjacent words and some single words repeatedly.

Cite This Paper

D. M. Anisuzzaman, Abdus Salam, "Authorship Attribution for Bengali Language Using the Fusion of N-Gram and Naive Bayes Algorithms", International Journal of Information Technology and Computer Science(IJITCS), Vol.10, No.10, pp.11-21, 2018. DOI:10.5815/ijitcs.2018.10.02

Reference

[1]Gómez-Adorno, Helena, Grigori Sidorov, David Pinto, DarnesVilariño, and Alexander Gelbukh. "Automatic authorship detection using textual patterns extracted from integrated syntactic graphs." Sensors 16, no. 9 (2016): 1374.

[2]Kešelj, Vlado, Fuchun Peng, Nick Cercone, and Calvin Thomas. "N-gram-based author profiles for authorship attribution." In Proceedings of the conference pacific association for computational linguistics, PACLING, vol. 3, pp. 255-264. 2003.

[3]Han, Hui, Wei Xu, HongyuanZha, and C. Lee Giles. "A hierarchical Naive Bayes mixture model for name disambiguation in author citations." In Proceedings of the 2005 ACM symposium on Applied computing, pp. 1065-1069. ACM, 2005.

[4]Altheneyan, Alaa Saleh, and Mohamed El BachirMenai. "Naïve Bayes classifiers for authorship attribution of Arabic texts." Journal of King Saud University-Computer and Information Sciences 26, no. 4 (2014): 473-484.

[5]Murphy, Kevin P. "Naive bayes classifiers." University of British Columbia (2006).

[6]Kibriya, Ashraf M., Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. "Multinomial naive bayes for text categorization revisited." In Australasian Joint Conference on Artificial Intelligence, pp. 488-499. Springer, Berlin, Heidelberg, 2004.

[7]Kim, Sang-Bum, Kyoung-Soo Han, Hae-Chang Rim, and Sung HyonMyaeng. "Some effective techniques for naive bayes text classification." IEEE transactions on knowledge and data engineering 18, no. 11 (2006): 1457-1466.

[8]Banerjee, S. "Author Identification in Bengali language." (2013).

[9]Chakraborty, Tanmoy. "Authorship identification in bengali literature: a comparative analysis." arXiv preprint arXiv:1208.6268 (2012).

[10]Phani, Shanta, ShibamouliLahiri, and Arindam Biswas. "Authorship attribution in bengali language." In Proceedings of the 12th International Conference on Natural Language Processing, pp. 100-105. 2015.

[11]Islam, Nazmul, Mohammed Moshiul Hoque, and Mohammad Rajib Hossain. "Automatic authorship detection from Bengali text using stylometric approach." In Computer and Information Technology (ICCIT), 2017 20th International Conference of, pp. 1-6. IEEE, 2017.

[12]Hossain, M. Tahmid, Md Moshiur Rahman, Sabir Ismail, and Md Saiful Islam. "A stylometric analysis on Bengali literature for authorship attribution." In Computer and Information Technology (ICCIT), 2017 20th International Conference of, pp. 1-5. IEEE, 2017.

[13]Chakraborty, Tanmoy, and Prasenjit Choudhury. "Authorship identification in Bengali language: A graph based approach." In Advances in Social Networks Analysis and Mining (ASONAM), 2016 IEEE/ACM International Conference on, pp. 443-446. IEEE, 2016. 

[14]Raju, NV Ganapathi, V. Vijay Kumar, and O. Srinivasa Rao. "Author based rank vector coordinates (ARVC) Model for Authorship Attribution." International Journal of Image, Graphics and Signal Processing 8, no. 5 (2016): 68.

[15]Abuhaiba, Ibrahim SI, and Mohammad F. Eltibi. "Author Attribution of Arabic Texts Using Extended Probabilistic Context Free Grammar Language Model." International Journal of Intelligent Systems and Applications 8, no. 6 (2016): 27.

[16]Menai, Mohamed El Bachir. "Detection of plagiarism in Arabic documents." International Journal of Information Technology and Computer Science 10 (2012): 80-89.