[Search-l] Grub Update

Jimmy Wales jwales at wikia.com
Fri Aug 3 00:28:26 UTC 2007


peter burden wrote:
> This may be the key to fast crawling but I don't think it is the key to 
> efficient
> crawling. Efficient crawling requires attention to all the points Jer 
> mentions
> and many more, most of them require parsing the HTML and performing
> various analyses. I'd rather have an efficient polite crawler than a
> fast crawler - I'd really like a crawler that was both of course ;-)

And this is the real potential strength of a distributed approach, I 
think.  With a small number of crawling machines, you perhaps have to 
fetch fetch fetch fetch without a lot of "thinking".  But with 10,000 or
100,000 or 1,000,000 machines pitching in?

I am not sure what the best architecture for that will end up being -- 
that's an empirical question and I don't think we have enough experience 
yet, any of us, to really know the answer.  So, we move forward and 
learn. :)

> In my (limited) experience parsing HTML is not particularly CPU
> intensive - what is a pain is checking for duplicate pages, alias
> host names, detecting various spider traps, deciding whether a
> URL represents a page already fetched and maintaining
> the various data structures.

*nod*

--Jimbo



More information about the Search-l mailing list