[Search-l] Using Grid Computing for Wikia Project

Pushparajan V vprajan at gmail.com
Thu May 10 13:34:53 UTC 2007


Happy to see a faster reply.. atleast after 5 days.

On 5/9/07, Simon Capstick <simon at pscomputer.co.uk > wrote:
>
> Hi,
>
> Pushparajan V wrote:
> > Hi all,
> >
> >   I am very much interested in contributing to the Wikia project, which
> > is originating out of this mailing list. Recently joined this list and
> > listening to some mails around. As the project seems to be in Planning
> > and resource allocation stage, i guess this is the right time to bring
> > in some technical discussions.. :).
> >
>
> There's no harm in discussing it :-)
>
> >   I got from archives that there are plans for using Nutch and Lucene.
> > Very good idea. But i am just wondering why there are worries from JW on
>
> > resources...!!.. We can create a distributed and chained resources with
> > the help of Grid computing. Whoever ready to contribute to the project
> > just need to run a client which expands the resources day-by-day.
> ...
>
> I love the simplicity of distributed search, which should result in low
> hardware and infrastructure costs.  It would be great if every website
> or blog hosted it's own part of a world-wide search engine.  That would
> indeed be power to the people.  However I think the simplicity of truly
> distributed search is a bit of a mirage.  Technically it will have to be
> quite complex to work.


Nothing such complex really. Since it involves many new terminologies, many
find it complex. But if you see BOINC, SETI at home projects, it is made very
simple. Many social projects gain help from grid developers like
Folding at home, AIDS at home etc. Wikia search is also a social project.

Why not we have a Wikia at Home project ?

List of distributed projects:
http://www.distributedcomputing.info/projects.html

But, simplistically, how many small websites even have the bandwidth to
> service all those search requests that may or may not be relevant to
> their site?


Exactly this is the reason why i talk about distributed way of indexed
search. Not all people want the world loads of information available on the
internet. They only want a regional search option. Most people want to know
what is the stack price in the local market?.. where to get online books
locally? and every thing.. For US/UK people, its very easy since most of the
sites are US/UK based. Think about other regions. Sites on developing
countries is growing in a faster pace than sites in developed countries.

My suggestion is to have a distributed server region-by-region stripped down
will make more useful search than anything else. How to do is the next
problem we need to sit and solve.

Lets assume a distributed search node has to answer all search requests
> from around the world, even if it's to say 'I have no matching results'.
>   Conservatively assuming 300 requests per second (sounds a bit too
> low), around the clock, that's over 8 million requests a month.
> Multiply that with 500 bytes of data and you get a very conservative 4GB
> per month of incoming traffic, or over 1Mb/s bandwidth.  The outgoing
> traffic with the search results could be even higher.


No.. distributed search doesn't ever mean search requests go around the
world. It must and will go to the local servers rather than the whole world.
And people just need the locally most used sites to be listed. If they want
global sites, it takes them some time. But if its cached in the nearby
server, it will not be problem too.

This doesn't even cover results aggregation and presentation to the
> search user, or explain how to stop search node admins fiddling their
> results to boost traffic to their sites.


A power admin can fiddle any search engine. Even the current popular SEs
like Google, live, ask etc. Its all the algorithm matters. This goes into
how well we design it. We can make distributed solutions more secure than a
centralized solution. But the problem is, people think it more complex. But
it is a permanent and futuristic solution.


At this point it seems distributed search is just not suited to be
> spread that thinly.  So maybe a federated approach with large
> organisations and ISPs hosting dedicated search data centres/clusters
> might be the way to go.  With each providing search nodes for their
> respective web sites/clients.


If fiddling is considered a big matter, the whole Wikipedia could not be
constructed efficient. We can have moderators/contributors each region to
take care of it.

Investing in data centers need returns in investments for the investors. I
dont know how this finance cycle work and i dont know how Wikipedia servers
are getting investments.. :). But I think this wikia search as a open source
search engine for the people, by the people and to the people.


The trouble is that the optimum solution might be to have just one data
> centre.  Then you're back to needing the computing resources of Google.


Yeah.. correct.. we can have one data centre as a back up as a central
control. But not for all resources we must depend on investments.

I dont think some higher authorities and owners of the Wikia project are
really hearing our shoutings here..

Simon
>

Thanks

-- 
Pushparajan V
http://www.vprajan.org
- - - - - - - -
Know me: http://www.hackerkey.com/decrypt.php?hackerkey=v4sw57BCHJUY$hw3/5ln2pr6AFOPSck3ma4u7FLMSw7DTWXm6l6FGIKLRSU$i862NLJ0CAe6$t3b4en4a23Ns3MSr9g5AGO

- - - - - - - -
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikia.com/pipermail/search-l/attachments/20070510/6c11d083/attachment.html 


More information about the Search-l mailing list