[Search-l] Grub Update
Jimmy Wales
jwales at wikia.com
Fri Aug 3 00:28:26 UTC 2007
peter burden wrote:
> This may be the key to fast crawling but I don't think it is the key to
> efficient
> crawling. Efficient crawling requires attention to all the points Jer
> mentions
> and many more, most of them require parsing the HTML and performing
> various analyses. I'd rather have an efficient polite crawler than a
> fast crawler - I'd really like a crawler that was both of course ;-)
And this is the real potential strength of a distributed approach, I
think. With a small number of crawling machines, you perhaps have to
fetch fetch fetch fetch without a lot of "thinking". But with 10,000 or
100,000 or 1,000,000 machines pitching in?
I am not sure what the best architecture for that will end up being --
that's an empirical question and I don't think we have enough experience
yet, any of us, to really know the answer. So, we move forward and
learn. :)
> In my (limited) experience parsing HTML is not particularly CPU
> intensive - what is a pain is checking for duplicate pages, alias
> host names, detecting various spider traps, deciding whether a
> URL represents a page already fetched and maintaining
> the various data structures.
*nod*
--Jimbo
More information about the Search-l
mailing list