[Search-l] Grub Update

John McCormac jmcc at hackwatch.com
Fri Aug 3 00:41:45 UTC 2007


peter burden wrote:
> and many more, most of them require parsing the HTML and performing
> various analyses. I'd rather have an efficient polite crawler than a
> fast crawler - I'd really like a crawler that was both of course ;-)

Naturally. :) But there are so many elements in parsing a PPC or holding 
page or other junk that it does slow things down if done as part of the 
crawl.

> Fetching and parsing both have to be done, they could be done on
> separate machines but this does impose an overhead of extra
> network traffic. I suppose it ultimately depends on how distribution
> is actually organised.

There hasn't been much talk on the list about this. The data still has 
to be transferred to a central site and processed to make it searchable. 
Not having the raw data can be a problem when it comes to fault finding.

> In my (limited) experience parsing HTML is not particularly CPU
> intensive - what is a pain is checking for duplicate pages, alias
> host names, detecting various spider traps, deciding whether a
> URL represents a page already fetched and maintaining
> the various data structures.

A lot of that can be handled at the pre-Index stage. There other tricks 
that allow pages to be compared. Parsing HTML is almost straightforward 
as it does not really vary from page to page. (Though people do tend to 
break it in interesting ways.) Looking for a series of specific strings 
in the HTML can slow things down. The more I look at the problem, I 
wonder if it might be better just to use some kind wget like program 
that respects robots and concentrating the parsing on the filesystem. 
The organisation of a distributed crawl is the key. Letting the spiders 
loose and hoping for the best is not the way to go about things.

Regards...jmcc
-- 
******************************************************
John McCormac  *  e-mail: jmcc at whoisireland.com
MC2            *  voice:  +353-51-873640
22 Viewmount   *  web:  http://www.whoisireland.com/
Waterford      *  blog: http://blog.whoisireland.com
Ireland        *  Irish Domain Stats & Market Research
******************************************************



More information about the Search-l mailing list