[Grub-dev] problem of understanding "crawl-corruption" - anyone care to explain?

Balinny balinny at gmail.com
Mon Feb 4 18:29:52 UTC 2008


ab wrote:
> thanks for your analysis.
>
> some questions: does it make sense to deny results even if their are 
> partly incomplete or missing the answers for certain urls?
>   
If there isn't results for some URLs it is expected to return a 500 
answer for them.

> why not making use of the results that are inside the file, no matter if 
> url #x is not in there.
>   
Because then the server is unable to know if it sent you those urls. A 
limitation of the method.

> ever server replies with a http-header and then the content. why not 
> using these answers?
>   
They are used :?
> this needs to be a lot more foolprof and failsafe i think. make use of 
> every reply/data the servers receive, as much as possible.
>   
I don't really know the question/problem.

> dont waste crawling time/resources, or people will not be willing to 
> help, and sites will start banning constantly recrawling 
> clients/agent-strings (grub) if the clients are acting sub-intelligently 
> and keep recrawling whole sets of urls istead of just recrawling the 
> urls that are missing from the replyfile, that went wrong in the first 
> place.
>   
That's a problem with the clients, not the server. They are still quite 
new. I'm sure they'll support
resuming in the future. However, i still haven't found a case where it 
is so important. It should be
right first time. And in the cases there're problems, a clean run may be 
needed.

> the crawling->resultsaving needs to be much more robust, but also the 
> soapserver should make use of the valid data it gets and should reject 
> the whole rest of like 240 or 24x valid urls/replies just because of a 
> very few missing.
>   
See above.



More information about the Grub-dev mailing list