[Grub-dev] the second part of the open loop!
Jeremie Miller
jeremie at jabber.org
Wed May 21 00:25:04 UTC 2008
Not only can you upload urls, but probably more importantly, anyone
can grab/peruse the outputted arcs as well!
Top level directory, subdirs broken down by UTC /year/month/day/hour/:
http://soap.grub.org/arcs/
Some example urls:
http://soap.grub.org/arcs/2008/04/21/00/BenediktWildenhain.f1f6166f83d299146353c34b72130fec9148691e.arc.gz.idx
http://soap.grub.org/arcs/2008/04/21/00/BenediktWildenhain.f1f6166f83d299146353c34b72130fec9148691e.arc.gz
http://soap.grub.org/arcs/2008/04/21/00/jeremie.216969aa0f6d10fbe6bcd1e7ee5fc55d55473f16.arc.gz.idx
http://soap.grub.org/arcs/2008/04/21/00/jeremie.216969aa0f6d10fbe6bcd1e7ee5fc55d55473f16.arc.gz
The .idx is a text file of the format:
URL byteoffset bytelength
Which refers to the same *.arc.gz, and gives the exact (gzip
compressed) envelop information for that URL in that arc. The arcs
have been checked and "repacked" to make sure they are consistently
encoded and valid (which could still have issues, no guarantees).
CAVEATS!
- This url structure is likely to change as the way these get used
evolves
- We will not preserve the arcs for all of time, and will likely
expire old arcs at some based-on-available-resources point
- These are just the general pool of everything currently assigned by
the workunit dispatch, which may vary in the future
What can/should be done with these? *Anything!* Gathering stats on
who's crawling and how much, extracting more urls to crawl (and upload
via the email I sent previously), exploring the meta data or response
codes, looking for spam, visualizing random web data, etc.
The arcs are also being experimentally loaded into a nutch index with
the hope/plan to include that in the main search results in the near
future.
Tomorrow I'll try to get all of the code for this checked into svn as
well, the put.cgi, the dispatch, even the arc loader into nutch... a
lot of it is a mess though so I apologize, a work in progress :)
Jer
More information about the Grub-dev
mailing list