[Search-l] Using Grid Computing for Wikia Project

Simon Capstick simon at pscomputer.co.uk
Fri May 11 12:18:42 UTC 2007


Seth Finkelstein wrote:
> On Fri, May 11, 2007 at 09:00:11AM +0800, Grahame Gould wrote:
>> Seth,
>>
>> As I understand it, the whole point of distributed computing is not to
>> tie up your ISP but you use your spare CPU power.  You download a
>> project, your computer works on it and send back results.  All the
>> projects I've seen wouldn't use more of your internet than having your
>> mail program running.
> 
> 	Right, because those are projects for *CPU*. I'm saying the
> major resource bottleneck right now for search project experiments is
> not *CPU* so much as *bandwidth*.
> 
>> I'm not sure what is hoped to be accomplished by this Wikia at Home project.
> 
> 	A search engine is roughly made up of crawling, storage,
> indexing, ranking algorithms, and serving results. Of these, I think
> only the last, the serving results, lends itself (relatively) *easily*
> to grid computing. Storage, indexing, and ranking algorithms can
> basically be handled by a single home machine for experimental
> purposes. But the crawling requires a huge amount of bandwidth, and
> doing that in parallel yet coordinating the results is very hard.
> 

When thinking about the bandwidth limitations of distributed search 
perhaps we should break the problem down and consider each piece of the 
problem individually?  Below are some simple observations, feel free to 
pull these points apart :-)  The first two processes seem compatible 
with a SETI at Home type of distributed client.

1 - Crawling - Users or the software itself could easily throttle the 
crawling speed.  The more users, the less bandwidth required by each 
participant.  This is very scalable.

2 - Indexing - Again each node could index the crawled data at it's own 
speed if we have enough participants.

3 - Searching - This is a bit more real-time.  Do we collect and 
centralise the chunks of index from users, or do we distribute the search?

4 - Result aggregation - Combining and ranking the search results from 
many nodes would require quite a lot of bandwidth and could add quite a 
bit of latency to the user's search results.

5 - Result presentation - Would we make every node capable of displaying 
the results to a search user, or would we have centralised 
result/aggregation and results presentation?

At which point in the processes above do we stop being distributed?  My 
gut instinct is that maybe around point 3 or 4 the whole thing should 
largely be centralised in a data centre, although I would love the whole 
thing to be distributed.

Simon



More information about the Search-l mailing list