[Grub-dev] The Open Loop
Jeremie Miller
jeremie at jabber.org
Wed May 14 02:25:52 UTC 2008
Right now GrubNG is running in a very simple mode, take a list of urls
and store a list of responses, there's no intelligence to close the
"loop" and process the results back into better/more urls.
I've thought about how best to do this for GrubNG at great great
length, and while I can imagine (and have attempted a few) solutions
of many sorts, I can't come up with any that I'm really happy with :)
So I have a very simple idea that I'd like everyone here and anyone
interested in a truly open community run web crawler to think about:
leave it OPEN.
It's very simple, anyone can generate a list of URLs to be crawled and
upload them. We'll use a simple perl script to convert a list of URLs
into our current workunits which are then handed out. The resulting
ARC files will then be posted publicly for anyone to grab and process
for whatever purpose they want.
There will be numerous and obvious issues we'll have to sort through
such as: approving the submitted URLs, making sure nobody is abusing
the downloading of the bulk ARCs, prioritization of different sets of
URLs, separating static from dynamic URLs, detecting injections,
stats, and so on. I really think we should focus on doing just the
one thing well first (distribute turning URLs into ARCs), and deal
with the rest of these after that.
To take this first step and fully expose what we currently have as an
Open Loop, I'll be posting soon two things:
- A URL where anyone w/ a user+pass can PUT a file that contains a
flat list of HTTP URLs (one per line) and the resulting directory of
any that have been uploaded.
- A URL to a directory structure that will contain one "index" file
per uploaded ARC (which itself contains the URL to the actual
contributed ARC, format TBD but will be minimal to start).
From these two points, *anyone* can write a script to process the
data and discover new URLs, build up statistics, determine
prioritization/spam, look for injections, etc.
From this simple input and output, an open loop, anyone can help
close it and explore all the different potential ways of improving the
process. GrubNG will finally be moving to the next stage :)
Jer
More information about the Grub-dev
mailing list