[Grub-dev] do we even really need a native client
jer
jeremie at jabber.org
Fri Jan 11 17:37:43 UTC 2008
> I've got to re-read all the stuff about the arc format and
> incorporate it into the patch. Expect v3 sometime late tomorrow.
Awesome! I think I'm going to set up a new SVN repo for grubng, I'll
try to get that done and you (and anyone else that wants it just
email me) commit access today.
> II'm also going to have to re-read the emails Jer just sent to the
> wikia mailing list to fully digest, but I really look forward to
> learning more about the Nutch setup wikia is using to gain the full
> "perspective" on the back-end aspects of wikia search.
Most of what is relevant to grub and the massive "web db" we're
building is hBase, which is still very young and we'll probably be
one of the largest projects using it and pushing it forward.
> Per the generated work-units -- Jer: how are you generating them
> now? I'm assuming this isn't the current "server" but some modified
> version you have running?
Brain-dead simple right now, take a stream of URLs into a very small
hack of a perl script that creates work unit files dynamically, but
that's just temporarily.
> It would be great to learn a bit about your next steps around that.
Once hBase is up and running, a "worker" script will randomly walk
the entire hBase namespace (based on URL as the primary key) and
elect URLs to be crawled (based on a number of factors), where
they'll get dropped into new workunits. This process will use etags
and if-modified-since headers to make it a lot more efficient, as
well as take into account the star ratings for URLs to qualify them
as more or less important.
Jer
More information about the Grub-dev
mailing list