[Grub-dev] back working on some grub stuff :) - New workunit format
Balinny
balinny at gmail.com
Mon May 12 22:14:19 UTC 2008
Bartek Jasicki wrote:
> Yes, but it simpler to make when you use xml parser (especially in
> higher level languages). There no too much differences between plain
> text and xml in C
Plain text is easier :)
> but in C#, Java or Python this make big difference
> when you can read values from file by 2 functions than 50 ;)
It's not my fault that Java needs a dozen BufferedInputStreamFileReaders
to read a file ;)
Jeremie Miller wrote:
> For a newer format I don't want to concentrate on a replacement
> workunit format, but instead look at a complimentary one. We need a
> format where we can hand out a hostname and some seed paths, suggested
> headers, probably already parsed robots rules (so we can add
> additional ones), any known if-modified-since and etags, a crawl rate,
> etc. The grub clients using this format would actually "crawl" the
> site(s) then. You could probably convert this format into something
> you could feed into heritrix for instance.
>
Having the client perform a crawling would be easy to extend on my
proposed format. Parsed
robots rules would be harder to achieve.
More information about the Grub-dev
mailing list