[Grub-dev] grub dev status update

jer jeremie at jabber.org
Tue Sep 25 21:36:52 UTC 2007


I've been meaning to send out the low-down on all the Grubbing going  
on the past month or so, and some ideas for where it's all going,  
feel free to ask if I don't answer anything anyone might want to know  
here :)

First, most everyone should have noticed the global stats are working  
finally, and in having them up you can see the service still goes  
down semi-frequently.  We've got the entire thing "throttled down" as  
far as it will go and it's still crawling millions of urls daily and  
filling up the 30GB partition it's caged in for testing :)

So, some things learned about the current Grub system:
	* it's not recursive (doesn't automatically discover/inject new urls)
	* it is capable of obeying robots when injected with them
	* only grabs text/html right now
	* uses a simple checksum to look for changes
	* doesn't track ETag or Last-Modified (pretty major flaws IMO)
	* was over-engineered for modularity
	* uses a rather obtuse SOAP encoding
	* stores crawl results in it's own also obtuse encoding

Hmm, that's enough pain to start with... so the very first goal was  
to get the crawl output in a more usable format, the Internet Archive  
ARC format (http://www.archive.org/web/researcher/ 
ArcFileFormat.php).  That happened this week, and now the work-unit  
binary blobs are being converted into much more useful ARC files  
automatically, yay!

The next step is to get a lot more URLs loaded, there's about a  
million total that exist right now, basically a random sample, and  
we're churning though those a few times a day.  I have extracted over  
16 million more urls from a wikipedia snapshot and before they get  
loaded they have to go through a robots check/import, that's the goal  
for this week.  Once there's a solid base of URLs, the hope is to  
then start extracting new/discovered ones from the resulting ARC  
files on the output, keep building on itself.

Moving up to the big picture, the overall goal here is to focus Grub  
on being a completely open both on the input and output, a shared  
crawling resource for use by anyone.  More specifically, to turn the  
administration into an open wiki where anyone can suggest new URLs,  
review existing URLs, create site policies, and view crawl stats and  
samples for any set.  On the output anyone will be able to grab the  
latest cached copies of individual URLs, get entire snapshots/sets as  
they happen, or even build custom jobs to filter through and grab  
copies of just what they need.  I'll take some time to get all this  
together of course :)

Jer



More information about the Grub-dev mailing list