[Grub-dev] grub dev status update

jer jeremie at jabber.org
Fri Sep 28 07:45:03 UTC 2007


> Great news! Thank you for all the work :)

Thanks!  I hope soon everyone will benefit from it :)

> . If ticket #1
> (http://dev.grub.org/cgi-bin/trac.cgi/ticket/1) were to get fixed,  
> I plan on
> developing a crawler that can handle many of these issues (some are  
> still
> dependent on the server providing things like the ETag and last  
> modified).

Yeah, #1 has to be found before it can get fixed, it's an ancient rev  
of gsoap not helping things either.

And absolutely, ETag and LM need immediate attention, I'm very  
unhappy that it wasn't baked in already...

>
>> 	* it's not recursive (doesn't automatically
>> discover/inject new urls)
>> 	* it is capable of obeying robots when injected with them
>> 	* only grabs text/html right now
>> 	* uses a simple checksum to look for changes
>> 	* doesn't track ETag or Last-Modified (pretty major flaws IMO)
>> 	* was over-engineered for modularity
>> 	* uses a rather obtuse SOAP encoding
>> 	* stores crawl results in it's own also obtuse encoding
>
> I had originally considered making my own crawler, but I was just  
> noticing
> that that archive.org already has a crawler than speaks Arc
> (http://crawler.archive.org/). "Heritrix is the Internet Archive's
> open-source, extensible, web-scale, archival-quality web crawler."  
> I think
> it would be trivial to make this work for Grub.
> ...

Agreed, actually have heretrix running somewhere on the swlabs  
servers already so I'm pretty familiar with it.

One of the important discussions that needs to happen soon here is  
how to get a parallel API that is both smarter and simpler up and  
running as soon as possible.  It's kind of like getting back to the  
roots of what Grub was when it began, just a distributed crawler,  
pure and simple, and cleaning up some of the overhead that it  
collected after it went closed-source.

As soon as we know we can get output smoothly from the current  
system, we'll be talking a lot more about this next step :)

Jer


More information about the Grub-dev mailing list