[Search-l] Using Grid Computing for Wikia Project
Simon Capstick
simon at pscomputer.co.uk
Wed May 9 12:22:58 UTC 2007
Hi,
Pushparajan V wrote:
> Hi all,
>
> I am very much interested in contributing to the Wikia project, which
> is originating out of this mailing list. Recently joined this list and
> listening to some mails around. As the project seems to be in Planning
> and resource allocation stage, i guess this is the right time to bring
> in some technical discussions.. :).
>
There's no harm in discussing it :-)
> I got from archives that there are plans for using Nutch and Lucene.
> Very good idea. But i am just wondering why there are worries from JW on
> resources...!!.. We can create a distributed and chained resources with
> the help of Grid computing. Whoever ready to contribute to the project
> just need to run a client which expands the resources day-by-day.
...
I love the simplicity of distributed search, which should result in low
hardware and infrastructure costs. It would be great if every website
or blog hosted it's own part of a world-wide search engine. That would
indeed be power to the people. However I think the simplicity of truly
distributed search is a bit of a mirage. Technically it will have to be
quite complex to work.
But, simplistically, how many small websites even have the bandwidth to
service all those search requests that may or may not be relevant to
their site?
Lets assume a distributed search node has to answer all search requests
from around the world, even if it's to say 'I have no matching results'.
Conservatively assuming 300 requests per second (sounds a bit too
low), around the clock, that's over 8 million requests a month.
Multiply that with 500 bytes of data and you get a very conservative 4GB
per month of incoming traffic, or over 1Mb/s bandwidth. The outgoing
traffic with the search results could be even higher.
This doesn't even cover results aggregation and presentation to the
search user, or explain how to stop search node admins fiddling their
results to boost traffic to their sites.
At this point it seems distributed search is just not suited to be
spread that thinly. So maybe a federated approach with large
organisations and ISPs hosting dedicated search data centres/clusters
might be the way to go. With each providing search nodes for their
respective web sites/clients.
The trouble is that the optimum solution might be to have just one data
centre. Then you're back to needing the computing resources of Google.
Simon
More information about the Search-l
mailing list