[Search-l] Grub Update
John McCormac
jmcc at hackwatch.com
Fri Aug 3 00:41:45 UTC 2007
peter burden wrote:
> and many more, most of them require parsing the HTML and performing
> various analyses. I'd rather have an efficient polite crawler than a
> fast crawler - I'd really like a crawler that was both of course ;-)
Naturally. :) But there are so many elements in parsing a PPC or holding
page or other junk that it does slow things down if done as part of the
crawl.
> Fetching and parsing both have to be done, they could be done on
> separate machines but this does impose an overhead of extra
> network traffic. I suppose it ultimately depends on how distribution
> is actually organised.
There hasn't been much talk on the list about this. The data still has
to be transferred to a central site and processed to make it searchable.
Not having the raw data can be a problem when it comes to fault finding.
> In my (limited) experience parsing HTML is not particularly CPU
> intensive - what is a pain is checking for duplicate pages, alias
> host names, detecting various spider traps, deciding whether a
> URL represents a page already fetched and maintaining
> the various data structures.
A lot of that can be handled at the pre-Index stage. There other tricks
that allow pages to be compared. Parsing HTML is almost straightforward
as it does not really vary from page to page. (Though people do tend to
break it in interesting ways.) Looking for a series of specific strings
in the HTML can slow things down. The more I look at the problem, I
wonder if it might be better just to use some kind wget like program
that respects robots and concentrating the parsing on the filesystem.
The organisation of a distributed crawl is the key. Letting the spiders
loose and hoping for the best is not the way to go about things.
Regards...jmcc
--
******************************************************
John McCormac * e-mail: jmcc at whoisireland.com
MC2 * voice: +353-51-873640
22 Viewmount * web: http://www.whoisireland.com/
Waterford * blog: http://blog.whoisireland.com
Ireland * Irish Domain Stats & Market Research
******************************************************
More information about the Search-l
mailing list