[Grub-dev] The Open Loop

Jeremie Miller jeremie at jabber.org
Wed May 14 02:25:52 UTC 2008


Right now GrubNG is running in a very simple mode, take a list of urls  
and store a list of responses, there's no intelligence to close the  
"loop" and process the results back into better/more urls.

I've thought about how best to do this for GrubNG at great great  
length, and while I can imagine (and have attempted a few) solutions  
of many sorts, I can't come up with any that I'm really happy with :)

So I have a very simple idea that I'd like everyone here and anyone  
interested in a truly open community run web crawler to think about:  
leave it OPEN.

It's very simple, anyone can generate a list of URLs to be crawled and  
upload them.  We'll use a simple perl script to convert a list of URLs  
into our current workunits which are then handed out.  The resulting  
ARC files will then be posted publicly for anyone to grab and process  
for whatever purpose they want.

There will be numerous and obvious issues we'll have to sort through  
such as: approving the submitted URLs, making sure nobody is abusing  
the downloading of the bulk ARCs, prioritization of different sets of  
URLs, separating static from dynamic URLs, detecting injections,  
stats, and so on.  I really think we should focus on doing just the  
one thing well first (distribute turning URLs into ARCs), and deal  
with the rest of these after that.

To take this first step and fully expose what we currently have as an  
Open Loop, I'll be posting soon two things:
	- A URL where anyone w/ a user+pass can PUT a file that contains a  
flat list of HTTP URLs (one per line) and the resulting directory of  
any that have been uploaded.
	- A URL to a directory structure that will contain one "index" file  
per uploaded ARC (which itself contains the URL to the actual  
contributed ARC, format TBD but will be minimal to start).

 From these two points, *anyone* can write a script to process the  
data and discover new URLs, build up statistics, determine  
prioritization/spam, look for injections, etc.

 From this simple input and output, an open loop, anyone can help  
close it and explore all the different potential ways of improving the  
process.  GrubNG will finally be moving to the next stage :)

Jer



More information about the Grub-dev mailing list