[Grub-dev] problem of understanding "crawl-corruption" - anyone care to explain?
Balinny
balinny at gmail.com
Mon Feb 4 18:29:52 UTC 2008
ab wrote:
> thanks for your analysis.
>
> some questions: does it make sense to deny results even if their are
> partly incomplete or missing the answers for certain urls?
>
If there isn't results for some URLs it is expected to return a 500
answer for them.
> why not making use of the results that are inside the file, no matter if
> url #x is not in there.
>
Because then the server is unable to know if it sent you those urls. A
limitation of the method.
> ever server replies with a http-header and then the content. why not
> using these answers?
>
They are used :?
> this needs to be a lot more foolprof and failsafe i think. make use of
> every reply/data the servers receive, as much as possible.
>
I don't really know the question/problem.
> dont waste crawling time/resources, or people will not be willing to
> help, and sites will start banning constantly recrawling
> clients/agent-strings (grub) if the clients are acting sub-intelligently
> and keep recrawling whole sets of urls istead of just recrawling the
> urls that are missing from the replyfile, that went wrong in the first
> place.
>
That's a problem with the clients, not the server. They are still quite
new. I'm sure they'll support
resuming in the future. However, i still haven't found a case where it
is so important. It should be
right first time. And in the cases there're problems, a clean run may be
needed.
> the crawling->resultsaving needs to be much more robust, but also the
> soapserver should make use of the valid data it gets and should reject
> the whole rest of like 240 or 24x valid urls/replies just because of a
> very few missing.
>
See above.
More information about the Grub-dev
mailing list