[Grub-dev] LZMA support in upload server
Balinny
balinny at gmail.com
Wed Feb 25 21:27:35 UTC 2009
Bartek Jasicki wrote:
> On 2009-02-25, at 18:57:10
> Balinny <balinny at gmail.com> wrote:
>
>
>>>> It's a two-minute change. Specially given that the last entry
>>>> must be generated on-the-fly anyway.
>>>>
>>>
>>>
>>> That same like run again new workunits with Accept header ;) If i
>>> good count, this 2 minutes take now around 1 month ;)
>>>
>> Workunits are pre-generated, but I think the last entry (containing
>> the user name) is made on request?
>>
>>
>
> AFAIK, yes - http://svn.swlabs.org/grubng/trunk/perl/dispatch.cgi
>
See? Two minutes :)
@@ -41,1 +41,1 @@
-print "PUT /arcs/$ENV{REMOTE_USER}.$key.arc.gz HTTP/1.0\r\nHost:
soap.grub.org\r\n\r\n";
+print "PUT /arcs/$ENV{REMOTE_USER}.$key.arc HTTP/1.0\r\nHost:
soap.grub.org\r\nAccept-Encoding: gzip,lzma\r\n\r\n";
>>> Unfortunately Jeremie
>>> don't have free time for made any changes in current dispatch
>>> server. So probably this 2 minutes task must wait few months before
>>> i start work on new dispatch server (at this moment i don't have
>>> access to dispatch server).
>>>
>>>
>> What really does the dispatch server? Which interface does it use to
>> grab the urls?
>>
>
> Dispatch server: for me, this is server which send workunits to users +
> workunit generator + robots.txt checker (i put all this things in one
> bag).
>
I wasn't taking into account robots.txt Although if it's not crawling,
there's no much need to
check robots.txt
> About urls: AFAIK (or again something was changed without public
> announcement ;) Jeremie, please fix me if i wrong) Grub have own
> database with URLs from which workunit generator get URLs to fetch.
>
So, there's a database with URLs. How to get to the database? There's
workunit.pl but it
just reads URLs from stdin...
> And only one way to get new URLs in system are sitemaps generated by
> clients or send by users (during creating workunits, Grub not connect
> to Nutch).
>
How sad. Is it really not taking into account the page contents?
>>>> As a bonus, also add Accept-Encoding: gzip, lzma
>>>>
>>>>
>>> 1. Where add it?
>>>
>>>
>> To the last workunit entry.
>>
>
> But Accept-Encoding is only for response from server, thus if you add
> this, server send you compressed information about .arc file ;)
>
Good point. It would be a header from the server, but inserted between
what the client should
send to the server :/
Any suggestion on a better way to express that?
More information about the Grub-dev
mailing list