[Grub-dev] problem of understanding "crawl-corruption" - anyone care to explain?
ab
spam at abittner.de
Mon Feb 4 17:52:48 UTC 2008
Balinny wrote:
> Seems like errors on first results aren't written to the arc.
> I also miss a final \n on all archives, but it isn't on correctly
> uploaded either so the server isn't complaining about it.
thanks for your analysis.
some questions: does it make sense to deny results even if their are
partly incomplete or missing the answers for certain urls?
why not making use of the results that are inside the file, no matter if
url #x is not in there.
ever server replies with a http-header and then the content. why not
using these answers?
this needs to be a lot more foolprof and failsafe i think. make use of
every reply/data the servers receive, as much as possible.
dont waste crawling time/resources, or people will not be willing to
help, and sites will start banning constantly recrawling
clients/agent-strings (grub) if the clients are acting sub-intelligently
and keep recrawling whole sets of urls istead of just recrawling the
urls that are missing from the replyfile, that went wrong in the first
place.
the crawling->resultsaving needs to be much more robust, but also the
soapserver should make use of the valid data it gets and should reject
the whole rest of like 240 or 24x valid urls/replies just because of a
very few missing.
:/
More information about the Grub-dev
mailing list