Finding Longest Common Substrings in Documents

Full Text (PDF, 620KB), PP.27-33

Views: 0 Downloads: 0

Author(s)

M.I.Khalil 1,* M.A.Hadi 2

1. Nuclear Research Center, Atomic Energy Authority, Cairo, Egypt

2. College of Computer Science and Information’s- Princes Nourah University-Riyadh – KSA Networking and Communication Systems

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2015.09.04

Received: 20 Mar. 2015 / Revised: 5 May 2015 / Accepted: 26 Jun. 2015 / Published: 8 Aug. 2015

Index Terms

Longest Common Substring, Data structures and Algorithms, Documents, Document Similarity

Abstract

This paper introduces an algorithm to address the problem of finding the longest common substring between two documents. This problem is known as the longest common substring (LCS) problem. The proposed algorithm is based on the convolution between the two sequences (named major sequence (X) which is represented as array and the minor one (Y) which is represented as circular linked list. An array of linked lists is established and a new node is created for each match between two substrings. If two or more matches in different locations in string Y share the same location in string X, the corresponding nodes will construct a unique linked-list. Accordingly, by the end of processing, we obtain a group of linked-lists containing nodes arranged in certain manner representing all possible matches between sequences X and Y. The algorithm has been implemented and tested in C# language under Windows platform. The obtained results presented a very good speedups and indicated that impressive improvements had been achieved.

Cite This Paper

M.I.Khalil, M.A.Hadi,"Finding Longest Common Substrings in Documents", IJIGSP, vol.7, no.9, pp.27-33, 2015. DOI: 10.5815/ijigsp.2015.09.04

Reference

[1]Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997).

[2]Hui, L.C.K.: Color Set Size Problem with Applications to String Matching. In: Proc. 3rd CPM (LNCS 644). pp. 230–243 (1992).

[3]Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th FOCS (SWAT). pp. 1–11 (1973).

[4]J. A. W. Faidhi and S. K. Robinson, "An empirical approach for detecting program similarity within a university programming environment", Computers & Education, vol. 11, no. 1, (1987), pp. 11-19.

[5]U. Manber, "Finding similar files in a large file system[C/OL]", In: Proceedings of the Winter USENIX Conference, (1994), pp. 1-10.

[6]BLAST, http://blast.ncbi.nlm.nih.gov/Blast.cgi, (2011) September.

[7]Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th FOCS (SWAT). pp. 1–11 (1973).

[8]Hui, L.C.K.: Color Set Size Problem with Applications to String Matching. In: Proc. 3rd CPM (LNCS 644). pp. 230–243 (1992).

[9]S. Grier, "A tool that detects plagiarism in Pascal programs", ACM SIGCSE Bulletin, vol. 13, no. 1, (1981), pp. 15-20.

[10]J. K. Me, M. R. Panigrahi, G. N. Dash and P. K. Meher, Wavelet Based Lossless DNA Sequence Compression for Faster Detection of Eukaryotic Protein Coding Regions, IJIGSP Vol.4, No.7, July 2012.

[11]Kamta Nath Mishra, P. C. Srivastava, Anupam Agrawal, Vivek Tripathi, Rishu Garg, Minutiae Distances and Orientation Fields Based Thumbprint Identification of Identical Twins, IJIGSP Vol.5, No.2, February 2013.

[12]Murugan. A. and Lavanya. B. DNA algorithmic approach to solve GCS problem. Journal of Computational Intelligence in Bioinformatics,3(2):239-247, 2010. 

[13]Murugan. A., Lavanya. B., and Shyamala. K.A novel programming approach for DNA computing. International Journal of Computational Intelligence Research, 7(2):199-209, 2011. 

[14]Lavanya. B. and Murugan. A. Discovering sequence motifs of different patterns parallelly using DNA operations. International Journal of Computer Applications, 3(1):18-24, Nov 2011. 

[15]Lavanya. B. and Murugan. A. A DNA based approach to _nd closed repetitive gapped subsequence from a sequence database. International Journal of Computer Applications, 29(5):45-49, Sep 2011. 

[16]http://en.wikipedia.org/wiki/Longest_common_substring_problem