A Domain Specific Key Phrase Extraction Framework for Email Corpuses

Full Text (PDF, 593KB), PP.53-60

Views: 0 Downloads: 0


I V S Venugopal 1,* D Lalitha Bhaskari 2 M N Seetaramanath 1

1. Department of IT, G V P College of Engineering(A), Andhra Pradesh,530048, India

2. Department of CS&SE, AUCE(A), Andhra Pradesh, Visakhapatnam, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2018.07.06

Received: 8 Mar. 2018 / Revised: 11 Apr. 2018 / Accepted: 23 May 2018 / Published: 8 Jul. 2018

Index Terms

Email Corpus, Key Phrase Extraction, Domain Specific Extraction, Modified Term Frequency, Modified Inverse Document Frequency


With the growth in the communication over Internet via short messages, messaging services and chat, still emails are the most preferred communication method. Thousands of emails are been communicated everyday over different service providers. The emails being the most effective communication methods can also attract a lot of spam or irrelevant information. The spam emails are annoying and consumes a lot of time for filtering. Regardless to mention, the spam emails also consumes the main allocated inbox space and at the same time causes huge network traffic. The filtration methods are miles away from perfection as most of these filters depends on the standard rules, thus making the valid emails marked as spam. The first step of any email filtration should be extracting the key phrases from the emails and based on the key phrases or mostly used phrases the filters should be activated. A number of parallel researches have demonstrated the key phrase extraction policies. Nonetheless, the methods are truly focused on domain specific corpuses and have not addressed the email corpuses. Thus this work demonstrates the key phrases extraction process specifically for the email corpuses. The extracted key phrases demonstrate the frequency of the words used in that email. This analysis can make the further analysis easier in terms of sentiment analysis or spam detection. Also, this analysis can cater to the need for text summarization. The proposed component based framework demonstrates a nearly 95% accuracy.

Cite This Paper

I V S Venugopal, D Lalitha Bhaskari, M N Seetaramanath, "A Domain Specific Key Phrase Extraction Framework for Email Corpuses", International Journal of Information Technology and Computer Science(IJITCS), Vol.10, No.7, pp.53-60, 2018. DOI:10.5815/ijitcs.2018.07.06


[1]Azmi Murad MA, Martin TP, “Using fuzzy sets in contextual word similarity”, Intell Data Eng Automa Learn (IDEAL), LNCS 3177 pp.517–522, 2004.

[2]Bannard C, Callison-Burch C, “Paraphrasing with bilingual parallel corpora” In Proceedings of the 43rd annual meeting of the Association for Computational Linguistics, pp. 597–604, 2005.

[3]Jusoh S, Masoud AM, Alfawareh HM, “Automated text summarization: sentence refinement approach”, Commun Comput  Inf Sci Digit Inf Process Commun 189(8),pp.207–218,2011.

[4]Zukerman I, RaskuttiB,Wen Y, “Experiments in query paraphrasing for information retrieval”, Adv Artif Intell, LNCS 2557, pp.24–35,2002.

[5]Sekine S, “Automatic paraphrase discovery based on context and keywords between NE pairs”, In Proceedings of IWP, 2005.

[6]Sekine S, “On–demand information extraction”, In Proceedings of the COLING/ACL onmain conference poster sessions, pp. 731–738, 2006.

[7]Bernhard D, Gurevych I, “Answering learners questions by retrieving question paraphrases from social Q&A sites”, In Proceedings of the 3rd workshop on innovative use of NLP for building educational applications, pp. 44–52,2008.

[8]Zhou L, Lin C, Munteanu DS, Hovy E, “ ParaEval: using paraphrases to evaluate summaries automatically”, In Proceedings of the human language technology conference of the North American chapter of the ACL, pp. 447–454,2006.

[9]Wu H, Zhou M, “Optimizing synonym extraction using monolingual and bilingual resources”, In Proceedings of the second international workshop on paraphrasing (IWP), pp. 72–79,2003.

[10]Kaji N, Kurohashi S, “Lexical choice via topic adaptation for paraphrasing written language to spoken language”, InfRetrTechnol LNCS 4182, pp.673–679,2006.

[11]Zhao SQ, Wang HF, Liu T, Li S, “Pivot approach for extracting paraphrase patterns from bilingual corpora”, In Proceedings of ACL–HLT, pp.780–788,2008.

[12]Zhao SQ, Lan X, Liu T, Li S, “Application-driven statistical paraphrase generation”, In Proceedings of the 47th  annual meeting of the ACL and the 4th IJCNLP of the AFNLP, pp.834–842,2009a.

[13]Zhao SQ, Wang HF, Liu T, Li S, “Extracting paraphrase patterns from bilingual parallel corpora”, Nat Lang Eng 15(4),pp.503–526,2009b. 

[14]Zhao SQ, Wang HF, Liu T, “Paraphrasing with search engine query logs”, In Proceedings of the 23rd international conference on computational linguistics (COLING), pp.1317–1325,2010.

[15]Barzilay R, McKeown KR, “Extracting paraphrases from a parallel corpus”, In Proceedings of the 39th annual meeting on Association for Computational Linguistics, pp. 50–57,2001.

[16]Hasegawa T, Sekine S, Grishman R, “Unsupervised paraphrase acquisition via relation discovery”, Technical Report 05-012, Proteus Project, Computer Department, New York University,2005.

[17]Ibrahim A, Katz B, Lin J, “Extracting structural paraphrases from aligned monolingual corpora”, In Proceedings of ACL, pp.10–17,2003.

[18]Shinyama Y, Sekine S, Sudo K, “Automatic paraphrase acquisition from news articles”, In Proceedings of HLTR, pp. 313–318,2002.

[19]Shinyama Y, Sekine S, “Paraphrase acquisition for information extraction”, In Proceedings of IWP, pp. 65–71,2003.

[20]Lin D, Pantel P, “DIRT—discovery of inference rules from text”, In Proceedings of ACM SIGKDD, pp. 323–328,2001.

[21]Ringlstetter C, Schulz KU, Mihov S, “Orthographic errors in web pages: toward cleaner web corpora”, J Comput Linguist 32(3),pp.295–340,2006.

[22]Harris Z, “Distributional structure. Structural and transformational linguistics”, pp.775–794,1970.

[23]Bhagat R, Ravichandran D, “Large scale acquisition of paraphrases for learning surface patterns”, In Proceedings of ACL–HLT, pp.674–682, 2008.

[24]Bhagat R, Hovy E, Patwardhan S, “Acquiring paraphrases from text corpora”, In Proceedings of the 5th international conference on knowledge capture (K-CAP), pp.161–168, 2009.

[25]Ho CF, Azmi Murad MA, Doraisamy S, Abdul Kadir R,“Comparing two corpus-based methods for extracting paraphrases to dictionary-based method”, Int J Semant Comput (IJSC) 5(2), pp.133–178, 2011.

[26]Colin Bannard and Chris Callison-Burch, “Paraphrasing with bilingual parallel corpora”, In ACL, pp.597–604, 2005.