[Grub-dev] Wondering about Architecture, Bug-base, Discussions, ...
Rainer Blome
rainer.blome at gmx.de
Sun Oct 19 15:32:46 UTC 2008
Bartek Jasicki wrote:
> 2008-10-19, 11:10:21
> Swaroop C H <swaroopch at gmail.com> wrote:
>> I just got curious about Wikia Search, and I couldn't find out much
>> details about how the project operates and the contributors, other
>> than http://re.search.wikia.com/about/get_involved.html .
Swaroop, you've followed the links on that page, I guess?
There are also some pages in the Search wiki:
http://search.wikia.com/wiki/Category:Grub lists Grub-specific pages.
> Information about Grub on this page are little outdated (i wonder when
> someone change it - maybe on link to Grub page? ;) )
For some reason, the main user-level documentation is now maintained
in the form of HTML files in svn.
Bartek, don't you have svn checkin access? :-o
>> 1. Is there an architecture about how Grub works?
There sure is. You get to guess it ;-).
Seriously, I am not aware of a good overview of the architecture.
http://search.wikia.com/wiki/Forum:Source_Code has a section on
architecture, but may not answer your question.
The following is my half-educated take on an overview of the Grub
architecture:
A set of URLs to be crawled is called a workunit.
Workunit files are prepared and served by a '''Grub workunit server'''.
A '''Grub client''' downloads a workunit file from a workunit server via
HTTP.
For each URL in the workunit file, the client tries to retrieve the
content from the server given by the URL and stores the content in a
result file, called an ARC file.
Note that the client does not parse the content body, it parses only the
response header.
When all URLs of a workunit have been parsed, the client uploads the
resulting ARC file to an '''ARC server'''.
The ARC files are then indexed by magic (read: I don't know what
component does the actual indexing, I guess it's Nutch).
@all: Please correct the above and then let's add it to the documentation.
> There is simple documentation about protocol(architecture) which use
> Grub:
> http://grub.org/?q=/node/121
That page omits some of the basics and goes into more detail.
>> 2. And the grubng code repo seems to have c, java, csharp, perl, etc.
>> directories, are all of these used?
> At this moment only C# client is still under development. Rest of
> code we leave for historical reasons.
My understanding was that the C client is in daily use by some (at least
by its developer :-).
>> 5. How do the servers actually dish out the search
>> results?
See http://search.wikia.com/wiki/Tech/Open_Index .
>> Is it a pure Nutch index or are there Wikia-specific code on
>> top of it?
Don't know whether the Nutch is vanilla or not.
http://search.wikia.com/wiki/Tech leads to pages explaining the
indexing. Whether they are up to date, I have no idea.
In addition to the index itself, the Search UI queries a KT server,
which serves the users' contributions, including ranking information.
Search the search-ui mailing list archives for more on this.
Recently, WISE applications further extend the UI to query all kinds of
other resources.
These should be enough links to thoroughly confuse you ;-).
Cheers, Rainer
More information about the Grub-dev
mailing list