Adaptive Dictionary-based Compression of Protein Sequences

Full Text (PDF, 209KB), PP.1-6

Views: 0 Downloads: 0

Author(s)

Akash Nag 1 Sunil Karforma 1

1. Dept. of Computer Science, The University of Burdwan, Rajbati, Burdwan 713104, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijeme.2017.05.01​

Received: 3 Jan. 2017 / Revised: 1 Feb. 2017 / Accepted: 14 Feb. 2017 / Published: 8 Sep. 2017

Index Terms

Protein sequence compression, dictionary based compression, huffman encoding

Abstract

This paper introduces a simple and fast lossless compression algorithm, called CAD, for the compression of protein sequences. The proposed algorithm is specially suited for compressing proteomes, which are the collection of all proteins expressed by an organism. Maintaining a changing dictionary of actively used amino-acid residues, the algorithm uses the adaptive dictionary together with Huffman coding to achieve an average compression rate of 3.25 bits per symbol, better than most other existing protein-compression and general-purpose compression algorithms known to us. With an average compression ratio of 2.46:1 and an average compression rate of 1.32M residues/sec, our algorithm outperforms every other compression algorithm for compressing protein sequences in terms of the balance in compression-time and compression rate.

Cite This Paper

Akash Nag, Sunil Karforma,"Adaptive Dictionary-based Compression of Protein Sequences", International Journal of Education and Management Engineering(IJEME), Vol.7, No.5, pp.1-6, 2017. DOI: 10.5815/ijeme.2017.05.01

Reference

[1]Nevill-Manning, Craig G., and Ian H. Witten. "Protein is incompressible." Data Compression Conference, 1999. Proceedings. DCC'99. IEEE, 1999.

[2]Hategan, Andrea, and Ioan Tabus. "Protein is compressible." Proceedings of the 6th Nordic Signal Processing Symposium, NORSIG. 2004.

[3]Cao, Minh Duc, et al. "A simple statistical algorithm for biological sequence compression." Data Compression Conference, 2007. DCC'07. IEEE, 2007.

[4]Adjeroh, Donald, and Fei Nan. "On compressibility of protein sequences." Data Compression Conference (DCC'06). IEEE, 2006.

[5]Daniels, Noah M., et al. "Compressive genomics for protein databases." Bioinformatics 29.13 (2013): pp:283-290.

[6]Matsumoto, Toshiko, Kunihiko Sadakane, and Hiroshi Imai. "Biological sequence compression algorithms." Genome informatics 11 (2000): 43-52.

[7]Huffman, David A. "A method for the construction of minimum-redundancy codes." Proceedings of the IRE 40.9 (1952): 1098-1101.

[8]Computing, WinZip. "WinZip. How Do You Encrypt Files in a Zip File with WinZip?” (2004)  [http://kb. winzip. com/kb/entry/78/].

[9]Roshal, A. "WinRar archiver, a powerful tool to process RAR and ZIP files." WinRAR GmbH (2009). [http://www.rarlab.com/]

[10]Pavlov, Igor. "7-zip." (2014). [http://www.7-zip.org/]

[11]Gailly J, Adler M: gzip (GNU zip) compression utility. [http://www.gnu.org/software/gzip/].

[12]Burrows, Michael, and David J. Wheeler. "A block-sorting lossless data compression algorithm." (1994).

[13]Engelson, Vadim, Dag Fritzson, and Peter Fritzson. "Lossless Compression of High-volume Numerical Data from Simulations." Data Compression Conference. 2000.

[14]Wu, Cathy H., et al. "The Universal Protein Resource (UniProt): an expanding universe of protein information." Nucleic acids research 34.suppl 1 (2006): D187-D191.

[15]Altschul, Stephen F., et al. "Basic local alignment search tool." Journal of molecular biology 215.3 (1990): 403-410.