[Grub-dev] back working on some grub stuff :) - New workunit format

Bartek Jasicki thindil2 at gmail.com
Mon May 12 12:46:13 UTC 2008


On 2008-05-12 at. 12:57:15
Balinny <balinny at gmail.com> wrote:

> Jeremie Miller wrote:
> > Good suggestions on the headers, I also think we need to use the
> > exact same accept encoding as IE since any squid proxies will serve
> > it from the cache then.
> >   
> I don't see why a squid would refuse to give out a gzipped content to
> a user asking for gzip
> content even if the original query was from a user which also
> accepted deflate...
> It's a http-proxy, so i'd expect it to understand compressions and
> even compressing or
> decompressing the cached content by itself (unless disabled by 
> configuration).
> 

After add in C# client this option i run it in test
mode to see how much been compressed content page. After crawl around
50k url's i don't saw any deflate compression (all made by gzip). Thus
i think deflate compression can be optional.
If i good saw, IE has different settings: often like Firefox (only gzip
and deflate) sometimes additional (bzip, bzip2, compress). I think
enough is only gzip and deflate.

> > We'll definitely work on another more advanced alternative format
> > to the workunit one, after the back-end us running a little more  
> > smoothly :)
> >
> > Jer
> Ok, i bite :-)
> After discussing with Bartek how to improve the output ARC format i
> have been also thinking about the input format.
> What do you think about the following one?
> 
> NGW/0.2 100 Go crawling!
>  max-download=1073741824; output-format=arc,zip required;
> User-Agent: GrubNG 20080128
> Accept: text/*
> Accept-Charset: utf-8
> 
> GET http://homepage3.nifty.com/naonaorin/
> If-Modified-since: Wed, 3 Nov 2004 17:21:05 GMT
> 
> GET http://www.miastoplusa.pl/ output-content: headers-only
> 
> GET http:// www.vinolentus.nl:1234/
> 
> ...
> 
> PUT 
> http://soap.grub.org:57/arcs/Balinny.bbdacb62d1c82d8f114d71d79f954caea120861c.arc.gz
> Cookie: workunitID=7
> 
> 
> 

[I cut mail here - for details about proposed workunit format please
see ealier mail]

> 
> Opinions?
> 

Mainly - its looks better, but I'm stubborn and propose do this
same in xml format ;) It can looks that:

<?xml version="1.0" standalone="no"?>
<workunit version="NGW/0.2">
	<head status="100">
		<option
status="optional">max-download=1073741824</option>
		<option status="required">output-format=arc,zip
required</option>
		<header name="User-Agent">GrubNG
20080128</header>	
	</head>
	<urls>
		<link method="GET">
			<url>http://homepage3.nifty.com/naonaorin/</url>
			<header name="If-Modified-since">Wed, 3 Nov 2004
17:21:05 GMT</header>
		</link>
	</urls>
	<upload method="PUT">
		<url>http://soap.grub.org:57/arcs/Balinny.bbdacb62d1c82d8f114d71d79f954caea120861c.arc.gz</url>
		<header name="Cookie">workunitID=7</header>
	</upload>
</workunit>

And now little explain:
Plain text still have this same problem like old workunit. You must or
write in code amount of links in one workunit (and every time if this
amount is changed, you must change code too) or read all file to count
amount of links. In new version this can be little harder than in older,
because every block with link to crawl can have different amount of
lines. Thus to count amount of links to crawl you must check all text.
This is only one disadvantage which i find in this proposition. 

Making workunit as a xml have advantages:
- simpler to parse (most parses can count elements in xml file, thus
counting amount of links been simpler)
- human readable - with good named elements workunit can be easy
understand by everyone
- looking similar on all operating systems (every system use other new
line element. Then CRLF can looks good on Windows only, on other systems
in normal text editors output can be very interesting)
- simpler to create, in plain text you still must use some order to put
options for work. In xml this is necessary.

Of course, xml version has disadvantages too:
- More necessary data send to client. Not only white spaces but
elements name too
- Slower parse than plain text (this depend on library used to parse
xml file, from little slower to unusable)
- add any ;)

Bartek


More information about the Grub-dev mailing list