[Grub-dev] I would like to help with the grub python client
Bartek Jasicki
thindil2 at gmail.com
Mon Jan 12 09:48:54 UTC 2009
On 2009-01-11, at 16:53:22
Giorgos Logiotatidis <seadog at sealabs.net> wrote:
>
> Yes and no. The client tries to upload the arc file. If a 500 error
> (internal server problem) gets returned, tries again for a total of 3
> times.
>
> If it get a 401 error just fails and deletes the arc file, as you
> point out. The thing is that I don't really know why we get a 401
> error in first place. I think that the server returns a 401 if there
> something wrong with the arc file -so we should better delete it-,
> but again I'm not sure about that, maybe someone else can lighten up
> the server part.
>
> I 've noticed that almost 10% for the arc files get rejected with a
> 401 and yes it's a waste. Maybe we can try to implement a re-upload
> functionality in case of a 401, but as far as I know the server part
> is kind of obsolete and it's currently being recoded, so maybe it's
> just a server error.
>
Ok, short explanation why server throw 401 error ;)
First: how server validate .arc.gz files:
1) There is a password hashed with SHA1 algorithm (key)
2) Server get crawled url from .arc.gz file and generate new SHA1 key
from string ("old_key_value host path")
3) Back to #2 until all crawled url's are checked
4) Generate new SHA1 key from string ("old_key_value username").
Username is get from .arc.gz filename
5) Compare with file key (get from filename). If there any differences,
server throw 401 error. If everything is ok, you get 200 answer.
Second: what can produce problems in client:
- if you wrong count length of one crawled page, when server
validate .arc file, it don't get correct crawled url but for example
part of page. So it generate bad key. Proposed solution of this
problem: implement in client validate .arc file before send it on
server. In client you can only check order of urls and that your client
correctly count length of one "record" (crawled page). Read urls from
workunit and compare it with urls from .arc file.
- if in username are whitespaces: almost all clients (except C# in
version 0.8.3) don't url encode filename which is send to server. So, if
you have username (for example) "to. delete", server can throw 401
error too (this is weird ;) ). During tests in this situation i get
sometimes 401 error, sometimes correct 400 error (Bad request).
> I believe that also the c# client fails to upload some arcs, right?
>
Yes, but there is problem on server side ;)
Third: what produce problems on server:
- if in username are dots, server wrong read username and key from
sent .arc file. That is, because server split filename by dots for
retrieve username and key. So, if you have file "thindil.key.arc.gz"
all is ok (username: "thindil", key: "key"). But when you have file
"to. delete.key.arc.gz" all validation process fail (username: "to",
key:" delete"). Possible resolution: at this moment only test server
have fixed this bug. If you want participate in test it, you must
change address of dispatch server (server with workunits) from:
http://dispatch.grub.swlabs.org/do/workunit
to
http://dispatch.grub.org/do/workunit/dev
But test server can have few other hidden bugs, so please be careful ;)
Bartek
--
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net
Jabber: thindil at jabberpl.org
More information about the Grub-dev
mailing list