[Search-l] Using Grid Computing for Wikia Project

Simon Capstick simon at pscomputer.co.uk
Wed May 9 12:22:58 UTC 2007


Hi,

Pushparajan V wrote:
> Hi all,
> 
>   I am very much interested in contributing to the Wikia project, which 
> is originating out of this mailing list. Recently joined this list and 
> listening to some mails around. As the project seems to be in Planning 
> and resource allocation stage, i guess this is the right time to bring 
> in some technical discussions.. :).
> 

There's no harm in discussing it :-)

>   I got from archives that there are plans for using Nutch and Lucene. 
> Very good idea. But i am just wondering why there are worries from JW on 
> resources...!!.. We can create a distributed and chained resources with 
> the help of Grid computing. Whoever ready to contribute to the project 
> just need to run a client which expands the resources day-by-day.
...

I love the simplicity of distributed search, which should result in low 
hardware and infrastructure costs.  It would be great if every website 
or blog hosted it's own part of a world-wide search engine.  That would 
indeed be power to the people.  However I think the simplicity of truly 
distributed search is a bit of a mirage.  Technically it will have to be 
quite complex to work.

But, simplistically, how many small websites even have the bandwidth to 
service all those search requests that may or may not be relevant to 
their site?

Lets assume a distributed search node has to answer all search requests 
from around the world, even if it's to say 'I have no matching results'. 
  Conservatively assuming 300 requests per second (sounds a bit too 
low), around the clock, that's over 8 million requests a month. 
Multiply that with 500 bytes of data and you get a very conservative 4GB 
per month of incoming traffic, or over 1Mb/s bandwidth.  The outgoing 
traffic with the search results could be even higher.

This doesn't even cover results aggregation and presentation to the 
search user, or explain how to stop search node admins fiddling their 
results to boost traffic to their sites.

At this point it seems distributed search is just not suited to be 
spread that thinly.  So maybe a federated approach with large 
organisations and ISPs hosting dedicated search data centres/clusters 
might be the way to go.  With each providing search nodes for their 
respective web sites/clients.

The trouble is that the optimum solution might be to have just one data 
centre.  Then you're back to needing the computing resources of Google.

Simon



More information about the Search-l mailing list