[Grub-dev] back working on some grub stuff :)

Jeremie Miller jeremie at jabber.org
Mon May 12 03:05:04 UTC 2008


Good suggestions on the headers, I also think we need to use the exact  
same accept encoding as IE since any squid proxies will serve it from  
the cache then.

We'll definitely work on another more advanced alternative format to  
the workunit one, after the back-end us running a little more  
smoothly :)

Jer

On May 11, 2008, at 10:27 AM, Bartek Jasicki wrote:

> On 2008-05-11 at. 15:33:02
> Balinny <balinny at gmail.com> wrote:
>
>> Bartek Jasicki wrote:
>>> On 2008-05-10 at. 18:14:38
>>> Jeremie Miller <jeremie at jabber.org> wrote:
>>>
>>>
>>>> I've been toying with re-generating a bunch of workunits, and I'd
>>>> like to include this header:
>>>>
>>>> 	Accept: text/html
>>>>
>>>> Is there any other headers that we should include by default?
>>>>
>>>
>>> Maybe change this on:
>>>
>>> 	Accept: text/html, text/*
>>>
>>> Then crawler can get all text type but prefer html. Later client or
>>> server can convert for example .pdf, .doc or .odf files on .html
>>> files. Plus here is still problem with servers which not send any
>>> content type headers.
>>>
>> Better, but keep into mind that those formats wouldn't be delivered.
>> pdf is application/pdf,
>> doc is application/*vnd*.ms-word and odf vnd.oasis.opendocument
>> <http://www.iana.org/assignments/media-types/application/vnd.oasis.opendocument.text-web 
>> >.
>>
>
> You have right, my mistake, sorry.
>
>> I think it should be prefferred:
>> Accept: application/xhtml+xml, text/html, text/*,
>> application/vnd.oasis.opendocument
>> <http://www.iana.org/assignments/media-types/application/vnd.oasis.opendocument.text-web 
>> >.*,
>> application/pdf, application/*vnd*.ms-word
>>
>> Problem: A partial wildcard doesn't seem to be in the standards,
>> though could be sensible to use if understood by some major servers.
>> And there's a lot of opendocument mimes...
>>
>
> Then maybe at now we use only pdf?
>
> Accept: application/xhtml+xml, text/html, text/*, application/pdf
>
> I think, this may be enough at now.
>
>>> Maybe add:
>>>
>>> 	Accept-Encoding: gzip,deflate
>>>
>>> too? For now C# client has it by default. This can save bandwith
>>> and if i good saw, some servers give content only when it is set
>>> (example: Wikipedia return other page on no compressed connection
>>> and other for compressed)
>>>
>> Then, the arc could contain the page body compressed. If the arc is
>> supposed to contain the
>> page uncompressed it shouldn't appear on the workunit (but could be
>> added by the client).
>> deflate is probably not needed, as very few servers support it.
>>
>>
>
> Or .arc file can contain the page body uncompressed too. I based on
> this same option in any modern web browser - compressed connection to
> server and decompress it after received response and before show it to
> user. Arc file is compressed before send to server, thus imho
> compressing page body and later all .arc file can be little weird and
> make more work for server.
> Plus this proposition is very optional, i dont want force add it to
> basic client options ;)
>
>
> And one other thing - about if-modified-since header. If we want talk
> about modify workunit format (it been needed for if-modified-since
> header) - how about change it from plain text to xml file? I think  
> this
> can be easier to parse by clients and more flexible.
>
> Bartek
> _______________________________________________
> Grub-dev mailing list
> Grub-dev at wikia.com
> http://lists.wikia.com/mailman/listinfo/grub-dev
>



More information about the Grub-dev mailing list