[Grub-dev] next-gen design

jer jeremie at jabber.org
Sat Nov 24 21:16:05 UTC 2007


I'd like to propose a greatly simplified design for the next  
generation of Grub.  I've refined and simplified this about as far as  
I can, but if there are any major missing components do point them  
out :)

Those of you that are familiar with the basic architecture of Grub  
know that it's a central server handing out "work units" or lists of  
URLs to be crawled, which the clients then check, compress, and send  
back to the central server.

Currently this process is mediated through SOAP and has a lot of  
complexity to try and handle the unknown future that Grub had years  
ago, but now in the current state with what we want to do with it,  
the purpose has become ultra simple, and so should the protocol.

I'm hoping to start producing sample next-generation "work units"  
that are a single gzipped text file structured as a series of HTTP  
requests.  All a Grub client has to do is fetch the work units from  
grub.org (simple authenticated HTTP GET) and decompress, then take  
each HTTP request, look for the Host: header and extract the hostname  
(and optionally port), connect, and send that entire header from the  
work unit.  All of the responses should be saved/stored exactly as  
they are (or error response generated if the connect failed) into a  
compressed ARC file (http://www.archive.org/web/researcher/ 
ArcFileFormat.php).

The only special case is that the last entry in the work unit is a  
PUT with a special pathname given and points back to grub.org, the  
contents of this request is the ARC file that was created as a result  
from all the previous headers.

It's important to make this process easier as very soon here we're  
going to open up the download area for anyone that wants to grab bulk  
copies (ARC format) of the crawled data, as well as storing it in a  
big open hbase and hadoop cluster.  The goal is ultimately to take  
quality crawling out of the equation of building a search engine,  
that anyone can come and benefit from a shared common crawler.

Jer



More information about the Grub-dev mailing list