[Search-l] Grub Update
jer
jeremie at jabber.org
Thu Aug 2 18:39:41 UTC 2007
> In which case 'grub' has a long way to go. As far as I can tell
> from the
> Microsoft Visual C++
> code there is no support for robot exclusion. "robots.txt" is
> mentioned
> in a "todo"
> list and there's a function that recognises "robots.txt" in a URL, but
> the function
> dosen't appear to be called anywhere.
I believe there's some server side function involved in managing the
robots processing too, but I'm still trying to learn the entire thing
and can't really be authoritative yet... maybe I can see if Igor will
jump in on this thread :)
> There's no mention of exclusion
> via the
> <meta> tag attributes. The use of the "if-modified-since" HTTP
> request is
> hinted at in a "todo" list but the code doesn't seem to take advantage
> of this.
> I've no idea how it controls per-server traffic, possibly it relies on
> the "random"
> selection of sites to "spread the load".
I believe you're right. It also doesn't take advantage of keepalives.
> Crawling "at random" seems to me a bad idea for a variety of
> reasons. If the
> randomness implies random URLs on random sites than, in order to be
> well
> behaved the crawler needs to fetch the "robots.txt" file for each site
> prior to
> fetching the actual URL creating a significant extra network and
> server
> overhead.
> There will also be overheads from DNS lookups. [Those who have written
> crawlers that I have looked at seem to have found that DNS can
> represent
> a significant bottleneck.]
>
> Randomness also prevents the use of cookies as a strategy to crawl
> dynamic sites.
*nod*, all agreed.
> I don't think this theory is necessarily right. If the crawl targets
> have to be distributed
> from a central controlling server and the results sent back then the
> traffic level on the
> central server is going to be of the same order of magnitude as if the
> central machine
> crawled directly. The total network traffic will be greater as the
> crawled data has to
> make two network journeys (one from remote site to crawler, one from
> crawler to
> central machine).
While I don't want to start a whole thread about this particular
point, I mildly disagree on the most basic part, in that a bunch of
pages packed together and compressed is a lot easier to just stream/
dump onto a completely dumb file server, the model isn't complete
duplication even at the purest level.
Where I really start to believe in the distributed crawling is when
the clients get more intelligent, recognizing 404 pages, junk pages,
spider traps, common patterns (parked pages), and so on.
Secondarily, they can do some "lightweight" indexing, breaking out
the links, titles, etc and provide them in a structured form along
with the compressed package back to the big storage area. IMO it
would be nice if they could perform normalization on the content too,
but that's much more questionable and won't become a thread until
there's a big repository and people trying to work with it.
> In my experience an equally significant effort is required for setting
> up and
> tweaking filters to reject unwanted and irrelevant documents and
> avoiding
> any one of several >>interesting<< spider traps.
Agreed, I'd love to see this knowledge be part of the public commons.
Jer
More information about the Search-l
mailing list