[Grub-dev] Wondering about Architecture, Bug-base, Discussions, ...

Rainer Blome rainer.blome at gmx.de
Sun Oct 19 15:32:46 UTC 2008


Bartek Jasicki wrote:
> 2008-10-19, 11:10:21
> Swaroop C H <swaroopch at gmail.com> wrote:
>> I just got curious about Wikia Search, and I couldn't find out much
>> details about how the project operates and the contributors, other
>> than http://re.search.wikia.com/about/get_involved.html .

Swaroop, you've followed the links on that page, I guess?

There are also some pages in the Search wiki:
http://search.wikia.com/wiki/Category:Grub lists Grub-specific pages.

> Information about Grub on this page are little outdated (i wonder when
> someone change it - maybe on link to Grub page? ;) )

For some reason, the main user-level documentation is now maintained
in the form of HTML files in svn.
Bartek, don't you have svn checkin access? :-o

>> 1. Is there an architecture about how Grub works?

There sure is.  You get to guess it ;-).
Seriously, I am not aware of a good overview of the architecture.
http://search.wikia.com/wiki/Forum:Source_Code has a section on 
architecture, but may not answer your question.

The following is my half-educated take on an overview of the Grub 
architecture:

A set of URLs to be crawled is called a workunit.
Workunit files are prepared and served by a '''Grub workunit server'''.

A '''Grub client''' downloads a workunit file from a workunit server via 
HTTP.
For each URL in the workunit file, the client tries to retrieve the 
content from the server given by the URL and stores the content in a 
result file, called an ARC file.
Note that the client does not parse the content body, it parses only the 
response header.
When all URLs of a workunit have been parsed, the client uploads the 
resulting ARC file to an '''ARC server'''.

The ARC files are then indexed by magic (read: I don't know what 
component does the actual indexing, I guess it's Nutch).

@all: Please correct the above and then let's add it to the documentation.

> There is simple documentation about protocol(architecture) which use
> Grub:
> http://grub.org/?q=/node/121

That page omits some of the basics and goes into more detail.

>> 2. And the grubng code repo seems to have c, java, csharp, perl, etc.
>> directories, are all of these used?
> At this moment only C# client is still under development. Rest of
> code we leave for historical reasons.

My understanding was that the C client is in daily use by some (at least 
by its developer :-).

>> 5. How do the servers actually dish out the search
>> results? 

See http://search.wikia.com/wiki/Tech/Open_Index .

>> Is it a pure Nutch index or are there Wikia-specific code on
>> top of it?

Don't know whether the Nutch is vanilla or not.
http://search.wikia.com/wiki/Tech leads to pages explaining the 
indexing.  Whether they are up to date, I have no idea.

In addition to the index itself, the Search UI queries a KT server,
which serves the users' contributions, including ranking information.
Search the search-ui mailing list archives for more on this.
Recently, WISE applications further extend the UI to query all kinds of 
other resources.

These should be enough links to thoroughly confuse you ;-).

Cheers, Rainer


More information about the Grub-dev mailing list