[Grub-dev] back working on some grub stuff :) - New workunit format

Balinny balinny at gmail.com
Mon May 12 22:14:19 UTC 2008


Bartek Jasicki wrote:

> Yes, but it simpler to make when you use xml parser (especially in
> higher level languages). There no too much differences between plain
> text and xml in C 
Plain text is easier :)
> but in C#, Java or Python this make big difference
> when you can read values from file by 2 functions than 50  ;)  
It's not my fault that Java needs a dozen BufferedInputStreamFileReaders 
to read a file ;)



Jeremie Miller wrote:
> For a newer format I don't want to concentrate on a replacement  
> workunit format, but instead look at a complimentary one.  We need a  
> format where we can hand out a hostname and some seed paths, suggested  
> headers, probably already parsed robots rules (so we can add  
> additional ones), any known if-modified-since and etags, a crawl rate,  
> etc.  The grub clients using this format would actually "crawl" the  
> site(s) then.  You could probably convert this format into something  
> you could feed into heritrix for instance.
>   
Having the client perform a crawling would be easy to extend on my 
proposed format. Parsed
robots rules would be harder to achieve.




More information about the Grub-dev mailing list