[Search-l] Using Grid Computing for Wikia Project

Michael Christen mc at yacy.net
Fri May 11 08:27:49 UTC 2007


> I'm saying the
> major resource bottleneck right now for search project experiments is
> not *CPU* so much as *bandwidth*.
>

In the last years the YaCy project has seen many bottlenecks, but the  
major one that we currently see is (you will be surprised):
IO-Load
It turned out that indexing is a heavy db-application, and we want  
that people running YaCy can simultanously _work_ on their computer  
while the indexer is running. Therefore we slow down indexing a bit



> A search engine is roughly made up of crawling, storage,
> indexing, ranking algorithms, and serving results. Of these, I think
> only the last, the serving results, lends itself (relatively) *easily*
> to grid computing.
>

index-chunks must be distributed (DHT-positions) to other grid-nodes  
before 'serving results' takes place. Thats another nework task, but  
underestimated: its much more again a db-task. Needs IO-load.



> Storage, indexing, and ranking algorithms can
> basically be handled by a single home machine for experimental
> purposes. But the crawling requires a huge amount of bandwidth,
>

there is enough bandwith for every home-user. no problem. you need  
only a fraction of that what you use for file-sharing with other  
software.



> and doing that in parallel yet coordinating the results is very hard.
>

this is in fact easy. If you restrict the coordination of crawling to  
a specific subset (the leaves of the crawl tree) it is just no problem.

Greetings,
Michael
yacy.net




More information about the Search-l mailing list