[Grub-dev] the second part of the open loop!

Jeremie Miller jeremie at jabber.org
Wed May 21 00:25:04 UTC 2008


Not only can you upload urls, but probably more importantly, anyone  
can grab/peruse the outputted arcs as well!

Top level directory, subdirs broken down by UTC /year/month/day/hour/:
	http://soap.grub.org/arcs/
Some example urls:
	http://soap.grub.org/arcs/2008/04/21/00/BenediktWildenhain.f1f6166f83d299146353c34b72130fec9148691e.arc.gz.idx
	http://soap.grub.org/arcs/2008/04/21/00/BenediktWildenhain.f1f6166f83d299146353c34b72130fec9148691e.arc.gz
	http://soap.grub.org/arcs/2008/04/21/00/jeremie.216969aa0f6d10fbe6bcd1e7ee5fc55d55473f16.arc.gz.idx
	http://soap.grub.org/arcs/2008/04/21/00/jeremie.216969aa0f6d10fbe6bcd1e7ee5fc55d55473f16.arc.gz

The .idx is a text file of the format:
	URL byteoffset bytelength

Which refers to the same *.arc.gz, and gives the exact (gzip  
compressed) envelop information for that URL in that arc.  The arcs  
have been checked and "repacked" to make sure they are consistently  
encoded and valid (which could still have issues, no guarantees).

CAVEATS!
	- This url structure is likely to change as the way these get used  
evolves
	- We will not preserve the arcs for all of time, and will likely  
expire old arcs at some based-on-available-resources point
	- These are just the general pool of everything currently assigned by  
the workunit dispatch, which may vary in the future

What can/should be done with these?  *Anything!*  Gathering stats on  
who's crawling and how much, extracting more urls to crawl (and upload  
via the email I sent previously), exploring the meta data or response  
codes, looking for spam, visualizing random web data, etc.

The arcs are also being experimentally loaded into a nutch index with  
the hope/plan to include that in the main search results in the near  
future.

Tomorrow I'll try to get all of the code for this checked into svn as  
well, the put.cgi, the dispatch, even the arc loader into nutch... a  
lot of it is a mess though so I apologize, a work in progress :)

Jer



More information about the Grub-dev mailing list