Distributed web crawling

From Free net encyclopedia

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. The idea is to spread out the required resources of computation and bandwidth to many computers and networks.

1 Implementations
2 Draw-backs
3 See also
4 External links

[edit]

Implementations

As of 2003 most modern commercial search engines use this technique. Google use thousands of individual computers in multiple locations to crawl the Web.

Newer projects are attempting to use a less structured, more ad-hoc form of collaboration by enlisting volunteers to join the effort using, in many cases, their home or personal computers. LookSmart is the largest search engine to use this technique, which powers its Grub distributed web-crawling project.

This solution use computer connected to the Internet to crawl Internet addresses in the background. Upon downloading crawled web pages, they are compressed and sent back together with a status flag (e.g. changed, new, down, redirected) to the powerful central servers. The servers, which manage a large database, send out new URLs to clients for testing.

It appears that many people (including founding members) behind Grub left the project. The side effect of that is that bugs aren't being fixed and even after 4 years the project doesn't give the option for searching among crawled results.

Instead, a new project lauched in January 2005 shown promise by already (September 2005) passing 1.2 billion web pages crawled, offering an alpha version of the search engine to search crawled results with users being able to create their own ranking formulas. Very active community that can found in their forum.

Majestic-12 Distributed Search Engine are now crawling more than 15 million URLs on average every day, and the index will grow to the size of a Tier-1 search engines within short time.

[edit]

Draw-backs

According to the Nutch, an open-source search engine FAQ, the savings in bandwidth by distributed web crawling are not significant, since "A successful search engine requires more bandwidth to upload query result pages than its crawler needs to download pages...".

[edit]