[Search-l] What Is Wikia and How Real Is It?

John McCormac jmcc at hackwatch.com
Mon Aug 6 13:02:14 UTC 2007


jer wrote:
> John, I can understand your pessimism when looking at what we're  doing 
> as trying to be a "Google Killer" but we're on different  wavelengths.

Well some people seem to think that I am on a different planet entirely. :)

The venture is being portrayed as a Google Killer in the media coverage 
and spin. The problem is that there is no actual basis for such a claim 
other than it gives the media a nice soundbite and keeps the investors 
happy.

> We are not building yet-another-search-engine, we are putting our  
> efforts into making building ANY search engine easier, better tools,  
> better methods, more shared systems, etc.  This isn't one project,  it's 
> tens or even hundreds of them, and likely to take years.

So if I read this right, there is no search engine?

It is just an idea for a platform that is scalable and can be used for 
search engine development? But without knowing the processing 
requirements, the storage requirements and the bandwidth requirements, 
it is difficult to design such a platform.

> Clearly you're neck-deep in search development yourself, so you would  
> have a great opinion on what kinds of tools and resources would make  
> your life easier, do you have any suggestions?

Ideally, the best resource would be more time. By comparison everything 
else pales. The holy trinity of search is bandwidth, hardware and 
software.

The bandwidth required to spider tens of millions of websites on an 
ongoing basis is considerable. Therefore such a venture would need a lot 
of available bandwidth.

The hardware is also a very significant requirement. It would need a lot 
of servers to do a proper crawl of the web. It would also require a 
backend to process the resulting data into something usable. And a 
search interface would be required.

The software aspect is perhaps somewhat easier as the task can be 
clearly defined. It has to be scalable, fast and provide good results. 
However that is a massive simplification. There are some good Open 
Source products out there that do the job well. Nutch is one of the most 
popular products in this respect. It also has the elements of 
scalability required for large indices. And the tools to  work on the 
resultant data are well developed and supported. Most of the work will 
be on the resulting data.

The search index is the hard part. It takes a long time to develop a 
good, clean index. The Infinite Monkeys approach to building an index 
(following links and hoping that they will lead to new pages) is not the 
most efficient method of building an index quickly when any of the prior 
requirements are absent or deficient.

A good index makes the difference between a great search engine and a 
spam infested pile of junk. I'm not convinced that the Wikia people 
quite appreciate the level of work that goes into that aspect of 
developing a search engine. Crawling a clearly defined index such as 
that of Wikipedia or some other silo site is easy. However crawling the 
web is like trying to take a slice of a swirling nebula.

It isn't really a question of what we want. It is more a question of 
what the Wikia project can provide to make the task of developing a 
search engine easier. Developing a viable search index is the hardest 
task of all - the other elements (the hardware, the bandwidth and the 
software) can be acquired to some extent.

So what exactly can Wikia offer? Bandwidth? Hardware? Expertise? Can you 
give us some descriptions and specifications of the resources and 
expertise that is available to search engine developers? For most of us, 
we have to deal with the realities imposed by hardware and bandwidth 
limitations. We don't have the luxury of just theorising - everything we 
do is geared towards survival in a highly competitive market. Perhaps we 
SE people really are on a different wavelength to the Wikia people.

Perhaps the question foremost in the minds of many of the SE people on 
this list is this: why should be provide the search expertise? Or, to 
put it less diplomatically, why should we make you rich?

Regards...jmcc
-- 
******************************************************
John McCormac  *  e-mail: jmcc at whoisireland.com
MC2            *  voice:  +353-51-873640
22 Viewmount   *  web:  http://www.whoisireland.com/
Waterford      *  blog: http://blog.whoisireland.com
Ireland        *  Irish Domain Stats & Market Research
******************************************************



More information about the Search-l mailing list