[Grub-dev] next-gen design
jer
jeremie at jabber.org
Sat Nov 24 21:16:05 UTC 2007
I'd like to propose a greatly simplified design for the next
generation of Grub. I've refined and simplified this about as far as
I can, but if there are any major missing components do point them
out :)
Those of you that are familiar with the basic architecture of Grub
know that it's a central server handing out "work units" or lists of
URLs to be crawled, which the clients then check, compress, and send
back to the central server.
Currently this process is mediated through SOAP and has a lot of
complexity to try and handle the unknown future that Grub had years
ago, but now in the current state with what we want to do with it,
the purpose has become ultra simple, and so should the protocol.
I'm hoping to start producing sample next-generation "work units"
that are a single gzipped text file structured as a series of HTTP
requests. All a Grub client has to do is fetch the work units from
grub.org (simple authenticated HTTP GET) and decompress, then take
each HTTP request, look for the Host: header and extract the hostname
(and optionally port), connect, and send that entire header from the
work unit. All of the responses should be saved/stored exactly as
they are (or error response generated if the connect failed) into a
compressed ARC file (http://www.archive.org/web/researcher/
ArcFileFormat.php).
The only special case is that the last entry in the work unit is a
PUT with a special pathname given and points back to grub.org, the
contents of this request is the ARC file that was created as a result
from all the previous headers.
It's important to make this process easier as very soon here we're
going to open up the download area for anyone that wants to grab bulk
copies (ARC format) of the crawled data, as well as storing it in a
big open hbase and hadoop cluster. The goal is ultimately to take
quality crawling out of the equation of building a search engine,
that anyone can come and benefit from a shared common crawler.
Jer
More information about the Grub-dev
mailing list