[Search-l] What Is Wikia and How Real Is It?

peter burden peter.burden at gmail.com
Tue Aug 7 00:17:51 UTC 2007


John McCormac wrote:
> jer wrote:
>   
>> John, I can understand your pessimism when looking at what we're  doing 
>> as trying to be a "Google Killer" but we're on different  wavelengths.
>>     
>
> Well some people seem to think that I am on a different planet entirely. :)
>
> The venture is being portrayed as a Google Killer in the media coverage 
> and spin. The problem is that there is no actual basis for such a claim 
> other than it gives the media a nice soundbite and keeps the investors 
> happy.
>
>   
>> We are not building yet-another-search-engine, we are putting our  
>> efforts into making building ANY search engine easier, better tools,  
>> better methods, more shared systems, etc.  This isn't one project,  it's 
>> tens or even hundreds of them, and likely to take years.
>>     
>
> So if I read this right, there is no search engine?
>
> It is just an idea for a platform that is scalable and can be used for 
> search engine development? But without knowing the processing 
> requirements, the storage requirements and the bandwidth requirements, 
> it is difficult to design such a platform.
>
>   
>> Clearly you're neck-deep in search development yourself, so you would  
>> have a great opinion on what kinds of tools and resources would make  
>> your life easier, do you have any suggestions?
>>     
>
> Ideally, the best resource would be more time. By comparison everything 
> else pales. The holy trinity of search is bandwidth, hardware and 
> software.
>
> The bandwidth required to spider tens of millions of websites on an 
> ongoing basis is considerable. Therefore such a venture would need a lot 
> of available bandwidth.
>
>   
Well the sums aren't that difficult. Assuming 10,000,000,000 pages total,
an average size of 10kBytes and an average "half-life" of 6 months you
need about 50 Mbd bandwidth. The size and "half-life" figures were derived
from researches I did some 3-4 years ago. The increasing number of Web 2.0
sites infested with Ajax etc., make it harder but it still looks doable with
a particularly well-endowed personal set-up.
> The hardware is also a very significant requirement. It would need a lot 
> of servers to do a proper crawl of the web. It would also require a 
> backend to process the resulting data into something usable. And a 
> search interface would be required.
>
> The software aspect is perhaps somewhat easier as the task can be 
> clearly defined. It has to be scalable, fast and provide good results. 
> However that is a massive simplification. There are some good Open 
> Source products out there that do the job well. Nutch is one of the most 
> popular products in this respect. It also has the elements of 
> scalability required for large indices. And the tools to  work on the 
> resultant data are well developed and supported. Most of the work will 
> be on the resulting data.
>
> The search index is the hard part. It takes a long time to develop a 
> good, clean index. The Infinite Monkeys approach to building an index 
> (following links and hoping that they will lead to new pages) is not the 
> most efficient method of building an index quickly when any of the prior 
> requirements are absent or deficient.
>   
Is the "search index" here just a list of sites/URLs or all the parsed 
content and
metadata? Is the concern those numerous but small sites that aren't 
linked from
anywhere else? What about persuading the domain registrars to let us do
periodic DNS zone transfers? Then you get a list of all of the sites, of 
course
you've still got to sort the wheat from the chaff. The current debates 
on this
list don't seem to suggest any real answer to this problem other than 
getting
"the community" to look at each one. [Assuming the aforementioned
10,000,000,000 pages to be checked every 6 months and assuming it
takes a human 30 seconds to check and that that human will be able to
devote 2 hours a day to the task - that's a community of about 250,000
people.]

What suggestions and ideas are there for telling the wheat from the chaff?
> A good index makes the difference between a great search engine and a 
> spam infested pile of junk. I'm not convinced that the Wikia people 
> quite appreciate the level of work that goes into that aspect of 
> developing a search engine. Crawling a clearly defined index such as 
> that of Wikipedia or some other silo site is easy. However crawling the 
> web is like trying to take a slice of a swirling nebula.
>   

> It isn't really a question of what we want. It is more a question of 
> what the Wikia project can provide to make the task of developing a 
> search engine easier. Developing a viable search index is the hardest 
> task of all - the other elements (the hardware, the bandwidth and the 
> software) can be acquired to some extent.
>
> So what exactly can Wikia offer? Bandwidth? Hardware? Expertise? Can you 
> give us some descriptions and specifications of the resources and 
> expertise that is available to search engine developers? For most of us, 
> we have to deal with the realities imposed by hardware and bandwidth 
> limitations. We don't have the luxury of just theorising - everything we 
> do is geared towards survival in a highly competitive market. Perhaps we 
> SE people really are on a different wavelength to the Wikia people.
>
> Perhaps the question foremost in the minds of many of the SE people on 
> this list is this: why should be provide the search expertise? Or, to 
> put it less diplomatically, why should we make you rich?
>
> Regards...jmcc
>   





More information about the Search-l mailing list