Locating All Common Subsequences in Two DNA Sequences

Full Text (PDF, 540KB), PP.81-87

Views: 0 Downloads: 0

Author(s)

M. I. Khalil 1,2,*

1. Reactor Physics Dept., Nuclear Research Center, Atomic Energy Authority, Cairo, Egypt

2. Currently in a sabbatical leave as an Associate Prof. at Princess Nora Bent Abdurrahman University, Faculty of Computer and Information Sciences, Networking and Communication Dept., Riyadh, Kingdom of Saudi Arabia

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2016.05.09

Received: 20 Apr. 2015 / Revised: 3 Sep. 2015 / Accepted: 19 Nov. 2015 / Published: 8 May 2016

Index Terms

DNA similarity algorithms, DNA sequence comparison, DNA analysis, pattern recognition, Longest Common Sequence, Longest Common Subsequence

Abstract

Biological sequence comparison is one of the most important and basic problems in computational biology. Due to its high demands for computational power and memory, it is a very challenging task. The well-known algorithm proposed by Smith-Waterman obtains the best local alignments at the expense of very high computing power and huge memory requirements. This paper introduces a new efficient algorithm to locate the longest common subsequences (LCS) in two different DNA sequences. It is based on the convolution between the two DNA sequences: The major sequence is represented in the linked-list X while the minor one is represented in circular linked-list Y. An array of linked lists is established where each linked list is corresponding to an element of the linked-list X and a new node is added to it for each match between the two sequences. If two or more matches in different locations in string Y share the same location in string X, the corresponding nodes will construct a unique linked-list. Accordingly, by the end of processing, we obtain a group of linked-lists containing nodes that reflect all possible matches between the two sequences X and Y. The proposed algorithm has been implemented and tested using C# language. The benchmark test shows very good speedups and indicated that impressive improvements has been achieved.

Cite This Paper

M. I. Khalil, "Locating All Common Subsequences in Two DNA Sequences", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.5, pp.81-87, 2016. DOI:10.5815/ijitcs.2016.05.09

Reference

[1]O. Gotoh, “An Improved Algorithm for Matching Biological Sequences,” Journal of Molecular Biology, 162, pp:705~708, 1982.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.

[2]S. Grier, “A tool that detects plagiarism in Pascal programs”, ACM SIGCSE Bulletin, vol. 13, no. 1, (1981), pp. 15-20.

[3]J. A. W. Faidhi and S. K. Robinson, “An empirical approach for detecting program similarity within a university programming environment”, Computers & Education, vol. 11, no. 1, (1987), pp. 11-19.

[4]U. Manber, “Finding similar files in a large file system[C/OL]”, In: Proceedings of the Winter USENIX Conference, (1994), pp. 1-10.

[5]BLAST, http://blast.ncbi.nlm.nih.gov/Blast.cgi, (2011) September.

[6]C. Yu, S.-Y. Cheng, R. L. He and S. S. -T. Yau, “Protein map: An alignment-free sequence comparison method based on various properties of amino acids”, Gene, vol. 486, (2011), pp. 110-118.

[7]Y. Guo and T. -m. Wang, “A new method to analyze the similarity of the DNA sequences”, Journal of Molecular Structure: THEOCHEM, vol. 853, (2008), pp. 62–67.

[8]S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins”, Journal of Molecular Biology, vol. 48, no. 3, (1970), pp. 443–53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325, http://linkinghub.elsevier.com/retrieve/pii/0022-2836(70)90057-4.

[9]C. S. Iliopoulos and M. S. Rahman, “Algorithms for Computing Variants of the Longest Common Subsequence Problem”, Theoretical Computer Science archive Journal, vol. 395, no. 2-3, (2008), pp. 255-267.

[10]C. -P. P. Wu, N. -F. Law and W. -C. Siu, “Cross chromosomal similarity for DNA sequence compression”, Bioinformation, vol. 2, no. 9, (2008), pp. 412-416.

[11]C. S. Iliopoulos and M. S. Rahman, “New Efficient Algorithms for LCS and Constrained LCS Problem”, In Proceedings of the Third ACiD Workshop Durham, UK, vol. 9 of Texts in Algorithmics. King's College London, (2007), pp. 83-94.

[12]A. F. Klaib, Z. Zainol, N. H. Ahamed, R. Ahmad and W. Hussin, “Application of Exact String Matching Algorithms towards SMILES Representation of Chemical Structure”, International journal of computer and information science and engineering, (2007), pp. 497-501.

[13]K. Rieck, P. Laskov and K. -R. M¨uller, “Efficient Algorithms for Similarity Measures over Sequential Data: A Look Beyond Kernels”, DAGM 2006, LNCS 4174, (2006), pp. 374–383.

[14]G. Fox, X. Qiu, S. Beason, J. Y. Choi, M. Rho, H. Tang, N. Devadasan and G. Liu, “Case Studies in Data Intensive Computing”, Large Scale DNA Sequence Analysis as the Million Sequence Challenge and Biomedical Computing, (2009).

[15]K. Derouiche and D. A. Nicole, “Semantically Resolving Type Mismatches in Scientific Workflows”, OTM 2007 Workshops, Part I, LNCS 4805, (2007), pp. 125–135, Springer-Verlag Berlin Heidelberg 2007.

[16]DIALIGN, http://dialign.gobics.de/, (2011) September.

[17]D. Rose, J. Hertel, K. Reiche, P. F. Stadler and J. Hackermüller, “NcDNAlign: Plausible multiple alignments of non-protein-coding genomic Sequences”, Genomics, vol. 92, no. 1, (2008), pp. 65-74.

[18]E. Dong, J. Smith, S. Heinze, N. Alexander and J. Meiler, “BCL::Align—Sequence alignment and fold recognition with a custom scoring function online”, Gene, vol. 422, no. 1-2, (2008), pp. 41-46.

[19]B. Vishnepolsky and M. Pirtskhalava, “ALIGN MTX—An optimal pairwise textual sequence alignment program, adapted for using in sequence-structure alignment”, Computational Biology and Chemistry, vol. 33, no. 3, (2009), pp. 235-238.

[20]P. Kalsi, H. Peltola and J. Tarhio, “Comparison of Exact String Matching Algorithms for Biological Sequences”, In: Proc. BIRD ’08, 2nd International Conference on Bioinformatics Research and Development (ed. M. Elloumi et al.). Communications in Computer and Information Science 13, Springer (2008), pp. 417-426.

[21]G. Huang, H. Zhou, Y. Li and L. Xu, “Alignment-free comparison of genome sequences by a new numerical characterization”, Journal of Theoretical Biology, vol. 281, no. 1, (2011), pp. 107-112.

[22]Izzat Alsmadi, Maryam Nuser, “String Matching Evaluation Methods for DNA Comparison,” International Journal of Advanced Science and Technology Vol. 47, October, 2012, p. 13-32.   

[23]J. K. Me, M. R. Panigrahi, G. N. Dash and P. K. Meher,  Wavelet Based Lossless DNA Sequence Compression for Faster Detection of Eukaryotic Protein Coding Regions, IJIGSP Vol.4, No.7, July 2012.

[24]Mohammed Abo-Zahhad, Sabah M. Ahmed and Shimaa A. Abd-Elrahman. A Novel Circular Mapping Technique for Spectral Classification of Exons and Introns in Human DNA Sequences, IJITCS Vol. 6, No. 4, March 2014, PP.19-29.

[25]G. Sethuraman,Kavitha Joseph, Star Coloring Problem: The DNA Solution, IJITCS Vol. 4, No. 3, April 2012, PP.31-37.

[26]M.I.Khalil, M.A.Hadi, Finding Longest Common Substrings in Documents, IJIGSP Vol. 7, No. 9, 2015, 9, 27-33.