[Grub-dev] The Open Loop
Balinny
balinny at gmail.com
Wed May 14 13:32:57 UTC 2008
Jeremie Miller wrote:
> So I have a very simple idea that I'd like everyone here and anyone
> interested in a truly open community run web crawler to think about:
> leave it OPEN.
>
> It's very simple, anyone can generate a list of URLs to be crawled and
> upload them. We'll use a simple perl script to convert a list of URLs
> into our current workunits which are then handed out. The resulting
> ARC files will then be posted publicly for anyone to grab and process
> for whatever purpose they want.
>
I'm quite concerned about the quality as i expect 90% of submitted sites
to be spammy. But
it's such a crazy idea, that it could work, just as Wikipedia works.
> There will be numerous and obvious issues we'll have to sort through
> such as: approving the submitted URLs, making sure nobody is abusing
> the downloading of the bulk ARCs, prioritization of different sets of
> URLs, separating static from dynamic URLs, detecting injections,
> stats, and so on. I really think we should focus on doing just the
> one thing well first (distribute turning URLs into ARCs), and deal
> with the rest of these after that.
>
> To take this first step and fully expose what we currently have as an
> Open Loop, I'll be posting soon two things:
> - A URL where anyone w/ a user+pass can PUT a file that contains a
> flat list of HTTP URLs (one per line) and the resulting directory of
> any that have been uploaded.
> - A URL to a directory structure that will contain one "index" file
> per uploaded ARC (which itself contains the URL to the actual
> contributed ARC, format TBD but will be minimal to start).
>
I think it would be benefitial if that index format matched the one of
the newer workunit output. It
is something problematic from the ARC format. You need the whole content
before being able to
write it and it would be better if having an index file which would
index the position in the file.
More information about the Grub-dev
mailing list