[Grub-dev] do we even really need a native client

jer jeremie at jabber.org
Fri Jan 11 17:37:43 UTC 2008


> I've got to re-read all the stuff about the arc format and  
> incorporate it into the patch. Expect v3 sometime late tomorrow.

Awesome!  I think I'm going to set up a new SVN repo for grubng, I'll  
try to get that done and you (and anyone else that wants it just  
email me) commit access today.

> II'm also going to have to re-read the emails Jer just sent to the  
> wikia mailing list to fully digest, but I really look forward to  
> learning more about the Nutch setup wikia is using to gain the full  
> "perspective" on the back-end aspects of wikia search.

Most of what is relevant to grub and the massive "web db" we're  
building is hBase, which is still very young and we'll probably be  
one of the largest projects using it and pushing it forward.

> Per the generated work-units -- Jer: how are you generating them  
> now? I'm assuming this isn't the current "server" but some modified  
> version you have running?

Brain-dead simple right now, take a stream of URLs into a very small  
hack of a perl script that creates work unit files dynamically, but  
that's just temporarily.

> It would be great to learn a bit about your next steps around that.

Once hBase is up and running, a "worker" script will randomly walk  
the entire hBase namespace (based on URL as the primary key) and  
elect URLs to be crawled (based on a number of factors), where  
they'll get dropped into new workunits.  This process will use etags  
and if-modified-since headers to make it a lot more efficient, as  
well as take into account the star ratings for URLs to qualify them  
as more or less important.

Jer



More information about the Grub-dev mailing list