[Grub-dev] Wondering about Architecture, Bug-base, Discussions, ...

Bartek Jasicki thindil2 at gmail.com
Sun Oct 19 15:54:18 UTC 2008


2008-10-19, 17:34:45
Rainer Blome <rainer.blome at gmx.de> wrote:

> Bartek Jasicki wrote:
> > 2008-10-19, 11:10:21
> > Swaroop C H <swaroopch at gmail.com> wrote:
> >> I just got curious about Wikia Search, and I couldn't find out much
> >> details about how the project operates and the contributors, other
> >> than http://re.search.wikia.com/about/get_involved.html .
> 
> Swaroop, you've followed the links on that page, I guess?
> 
> There are also some pages in the Search wiki:
> http://search.wikia.com/wiki/Category:Grub lists Grub-specific pages.
> 

This documentations are little outdated (i moved all Grub docs to Grub
page - its simpler to find).

> > Information about Grub on this page are little outdated (i wonder
> > when someone change it - maybe on link to Grub page? ;) )
> 
> For some reason, the main user-level documentation is now maintained
> in the form of HTML files in svn.
> Bartek, don't you have svn checkin access? :-o
> 

I rant Jeremie about month's about changing it ;) For now i have
enough work with Grub - and i'm not sure - all repos (grub, re.search,
kt) use this same settings for access? 

> >> 1. Is there an architecture about how Grub works?
> 
> There sure is.  You get to guess it ;-).
> Seriously, I am not aware of a good overview of the architecture.
> http://search.wikia.com/wiki/Forum:Source_Code has a section on 
> architecture, but may not answer your question.
> 
> The following is my half-educated take on an overview of the Grub 
> architecture:
> 
> A set of URLs to be crawled is called a workunit.
> Workunit files are prepared and served by a '''Grub workunit
> server'''.
> 
> A '''Grub client''' downloads a workunit file from a workunit server
> via HTTP.
> For each URL in the workunit file, the client tries to retrieve the 
> content from the server given by the URL and stores the content in a 
> result file, called an ARC file.
> Note that the client does not parse the content body, it parses only
> the response header.
> When all URLs of a workunit have been parsed, the client uploads the 
> resulting ARC file to an '''ARC server'''.
> 
> The ARC files are then indexed by magic (read: I don't know what 
> component does the actual indexing, I guess it's Nutch).
> 
> @all: Please correct the above and then let's add it to the
> documentation.
> 

Lack only one information - some time ago, when i talk with Jeremie, he
propose that clients fetch url's from crawled pages too. All other
informations are ok.

> > There is simple documentation about protocol(architecture) which use
> > Grub:
> > http://grub.org/?q=/node/121
> 
> That page omits some of the basics and goes into more detail.
> 

It is VERY, VERY basic documentation. If there lack any information -
cry on me ;)

> >> 2. And the grubng code repo seems to have c, java, csharp, perl,
> >> etc. directories, are all of these used?
> > At this moment only C# client is still under development. Rest of
> > code we leave for historical reasons.
> 
> My understanding was that the C client is in daily use by some (at
> least by its developer :-).
> 

And probably Perl client too (Jeremie) ;) Why i add this to 'historical'
bucket - i wrote in earlier mail. (Additional: it make some life on
mailing list too :D )

Bartek


-- 
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net


More information about the Grub-dev mailing list