Implementation of Parallel Web Crawler through .NET Technology

Full Text (PDF, 754KB), PP.59-65

Views: 0 Downloads: 0

Author(s)

Md. Abu Kausar 1,* V. S. Dhaka 1 Sanjeev Kumar Singh 2

1. Dept. of Computer & System Sciences, Jaipur National University, Jaipur, India

2. Dept. of Mathematics, Galgotias University, Gr. Noida, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2014.08.07

Received: 13 May 2014 / Revised: 20 Jun. 2014 / Accepted: 18 Jul. 2014 / Published: 8 Aug. 2014

Index Terms

World Wide Web, Web Crawler, multiple HTTP connections, multi threading, URL, Database

Abstract

The WWW is increasing at very fast rate and data or information present over web is changes very frequently. As the web is very dynamic, it becomes very difficult to get related and fresh information. In this paper we design and develop a program for web crawler which uses multiple HTTP for crawling the web. Here we use multiple threads for implementation of multiple HTTP connection. The whole downloading process can be reduced with the help of multiple threads. This paper deals with a system which is based on web crawler using .net technology. The proposed approach is implemented in VB.NET with multithread to crawl the web pages in parallel and crawled data is stored in central database (Sql Server). The duplicacy of record is checked through stored procedure which is pre complied & checks the result very fast. The proposed architecture is very fast and allows many crawlers to crawl the data in parallel.

Cite This Paper

Md. Abu Kausar, V. S. Dhaka, Sanjeev Kumar Singh, "Implementation of Parallel Web Crawler through .NET Technology", International Journal of Modern Education and Computer Science (IJMECS), vol.6, no.8, pp.59-65, 2014. DOI:10.5815/ijmecs.2014.08.07

Reference

[1]Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard, Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, Stephen Wolff, “A Brief History of the Internet”, www.isoc.org/internet/history.
[2]Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan "Searching the Web." ACM Transactions on Internet Technology, vol. 1, no. 1, pp. 2-43, 2001.
[3]J. Cho and H. Garcia-Molina, “Parallel crawlers.” In Proceedings of the Eleventh International World Wide Web Conference, 2002, pp. 124 - 135, 2002.
[4]Altavista, Mar. 2008. URL www.altavista.com.
[5]A. Heydon and M. Najork, “Mercator: A scalable, extensible web crawler”, World Wide Web, vol. 2, no. 4, pp. 219-229, 1999.
[6]D. Fetterly, M. Manasse, M. Najork, and J. Wiener. “A large-scale study of the evolution of web pages”, In Proceedings of the twelfth international conference on World Wide Web, Budapest, Hungary, pp. 669-678. ACM Press, 2003.
[7]V. Shkapenyuk and T. Suel, Design and implementation of a high-performance distributed Web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE'02), San Jose, CA Feb. 26--March 1, pp. 357-368, 2002.
[8]O. Papapetrou and G. Samaras, “Minimizing the Network Distance in Distributed Web Crawling.” International Conference on Cooperative Information Systems, pp. 581-596, 2004.
[9]J Cho, H. G. Molina, Lawrence Page, “Efficient Crawling Through URL Ordering”, Computer Networks and ISDN Systems, vol. 30, no. (1-7), pp. 161-172, 1998.
[10]J. Cho and H. G. Molina, “The Evolution of the Web and Implications for an incremental Crawler”, In Proceedings of 26th International Conference on Very Large Databases (VLDB), pp. 200-209, September 2000.
[11]Md. Abu Kausar, V S Dhaka and Sanjeev Kumar Singh. “Web Crawler Based on Mobile Agent and Java Aglets” I.J. Information Technology and Computer Science, vol. 5, no. 10, pp. 85-91, 2013.
[12]Md. Abu Kausar, V S Dhaka and Sanjeev Kumar Singh. “An Effective Parallel Web Crawler based on Mobile Agent and Incremental Crawling” Journal of Industrial and Intelligent Information, vol. 1, no. 2, pp. 86-90, 2013.
[13]G. Pant, P. Srinivasan, and F. Menczer. “Crawling the Web.” In M. Levene and A. Poulovassilis, editors, Web Dynamics. Springer, 2004.
[14]Andrei Z. Broder, Marc Najork and Janet L. Wiener “Efficient URL Caching for World Wide Web Crawling”, WWW 2003 , May 20–24, 2003, Budapest, Hungary.
[15]Md. Abu Kausar, V S Dhaka and Sanjeev Kumar Singh. “Web Crawler: A Review.” International Journal of Computer Applications, vol. 63, no. 2, pp. 31-36, 2013.