[Search-l] Grub Update
peter burden
peter.burden at gmail.com
Thu Aug 2 23:33:29 UTC 2007
John McCormac wrote:
> jer wrote:
>
>> Where I really start to believe in the distributed crawling is when the
>> clients get more intelligent, recognizing 404 pages, junk pages, spider
>> traps, common patterns (parked pages), and so on.
>>
>
> This is the danger of confusing the function of crawlers with that of
> the search backend. The key to a fast and efficient crawl is that the
> crawler is streamlined and handles as many pages as possible in as short
> a time as possible. Breaking out to parse html is processor intensive
> and slows down crawling considerably.
>
This may be the key to fast crawling but I don't think it is the key to
efficient
crawling. Efficient crawling requires attention to all the points Jer
mentions
and many more, most of them require parsing the HTML and performing
various analyses. I'd rather have an efficient polite crawler than a
fast crawler - I'd really like a crawler that was both of course ;-)
Fetching and parsing both have to be done, they could be done on
separate machines but this does impose an overhead of extra
network traffic. I suppose it ultimately depends on how distribution
is actually organised.
In my (limited) experience parsing HTML is not particularly CPU
intensive - what is a pain is checking for duplicate pages, alias
host names, detecting various spider traps, deciding whether a
URL represents a page already fetched and maintaining
the various data structures.
> Regards...jmcc
>
More information about the Search-l
mailing list