[Grub-dev] back working on some grub stuff :) - New workunit format

Jeremie Miller jeremie at jabber.org
Mon May 12 17:36:23 UTC 2008


We'll be keeping the existing workunit format around so there will  
always be the text based format available for clients where that is  
easier to implement.

As for more advanced formats, that have the same advantages as XML  
(libraries that do all the parsing work in a few function calls), I am  
a much bigger fan of JSON right now :)

For a newer format I don't want to concentrate on a replacement  
workunit format, but instead look at a complimentary one.  We need a  
format where we can hand out a hostname and some seed paths, suggested  
headers, probably already parsed robots rules (so we can add  
additional ones), any known if-modified-since and etags, a crawl rate,  
etc.  The grub clients using this format would actually "crawl" the  
site(s) then.  You could probably convert this format into something  
you could feed into heritrix for instance.

I'd also like to spend some time thinking about involving the user,  
what if we can identify search forms or let the user select the most  
important links to follow? I've not thought much about this, but there  
could be some useful things here to take into account.

Just some ideas... :)

Jer

On May 12, 2008, at 11:04 AM, Bartek Jasicki wrote:

> On 2008-05-12 at. 17:32:42
> Balinny <balinny at gmail.com> wrote:
>
>> Bartek Jasicki wrote:
>>> And now little explain:
>>> Plain text still have this same problem like old workunit. You must
>>> or write in code amount of links in one workunit (and every time if
>>> this amount is changed, you must change code too) or read all file
>>> to count amount of links. In new version this can be little harder
>>> than in older, because every block with link to crawl can have
>>> different amount of lines. Thus to count amount of links to crawl
>>> you must check all text. This is only one disadvantage which i find
>>> in this proposition.
>> Just count the number of lines beginning with GET.
>> $ grep "^GET " workunit.1 | wc -l
>> Should give you the count, should you need it.
>>
>> I'm amazed on how you are going to count it with xml without having
>> the entire file read ;)
>>
>> I don't see that having a different number of lines is such a
>> trouble. Currently there is, even for
>> my code, which will need to be change when Jer adds more headers. By
>> removing the headers
>> with magic meanings like host: it's much simpler just treating them
>> all the same in a loop until
>> line is empty.
>>
>
> Yes, but it simpler to make when you use xml parser (especially in
> higher level languages). There no too much differences between plain
> text and xml in C but in C#, Java or Python this make big difference
> when you can read values from file by 2 functions than 50 ;)
>
>>> Making workunit as a xml have advantages:
>>> - simpler to parse (most parses can count elements in xml file, thus
>>> counting amount of links been simpler)
>>>
>> Only if you have a ready-to-use xml library. It's slower because it's
>> more complex.
>> I look ahead to your shell script client of xml workunits.
>>
>
> Making shell script can be "little" harder with xml, but how i wrote
> ealier - create new clients with high level languages can be easier.
>
>
>>> - human readable - with good named elements workunit can be easy
>>> understand by everyone
>>>
>> I find it quite readable, perhaps a bit harder to fully understand
>> and write. But anyone wishisng
>> to mess with workunits should be able to understand that
>> http://homepage3.nifty.com/naonaorin/
>> is a URL and User-Agent: a header.
>> Plus, there's documentation ;)
>>
>
> Documentation for xml files can be smaller and simpler ;)
>
>>> - looking similar on all operating systems (every system use other
>>> new line element. Then CRLF can looks good on Windows only, on
>>> other systems in normal text editors output can be very interesting)
>>>
>> Not so much. Most Linux editors accept CRLF perfectly. Perhaps too
>> happily. If i want to show
>> the CR i usually go to old vi.
>
> And i still nagger ;) Of course i not force to use xml, but i think  
> this
> can be better way for more complex workunits and more flexible than
> plain text.
>
> Bartek
> _______________________________________________
> Grub-dev mailing list
> Grub-dev at wikia.com
> http://lists.wikia.com/mailman/listinfo/grub-dev
>



More information about the Grub-dev mailing list