[Search-l] Grub Update

John McCormac jmcc at hackwatch.com
Thu Aug 2 21:18:05 UTC 2007


jer wrote:
> I've been through the same process and very much want to point out  
> something: It's *both* high automation *and* human oversight/ tweaking.  
> Exactly what you did, and I did, and everyone who's done  any amount of 
> serious crawling has to do, is add a crap-ton of human  intelligence to 
> the massive automation process, with constant  feedback as things jut 
> out here and there.

Yes Jer, but you don't know what I've done and vice versa. So I don't 
know if we have been through quite the same process. :)

The thing about search on this or any other significant scale is that it 
requires a completely different mindset to that required for building a 
web directory or wiki where each entry can be individually validated. I 
don't think that adding a 'crap-ton' of human intelligence to the 
process is an accurate description of what happens.

The indications that mark a site for deletion tend to be clear and it is 
the speed on which this happens that is important. Sometimes, this has 
to be applied to every website on an IP or even on the same DNS. It is a 
very anti-democratic process. Some are easy wins - linkswamps that can 
be identified by a DNS or IP. PPC that can be identified from a 
particular string, duplicate content pages that all have the same MD5 
hash etc. The hard part is when it goes beyond the easy wins to the 
stuff that requires a human decision.

> This should be shared and open, we all shouldn't have to be doing  this 
> independently.  It's part of Jimmy's vision to have an open wiki  serve 
> as the social gathering and common ground for the human side of  this 
> crawling equation.  I very much believe in it as a great way to  move 
> the whole search industry beyond everyone having to re-do all  this work.

Some of us (those lucky enough to survive in the search engine wars) 
have been doing this kind of work independently for years.  We do talk 
to each other but there is a slight attitude of "better him than me" 
when some other search engine venture goes dot.bomb. Some of the 
techniques and methodology of search engine development are closely held 
- none more closely than a good search index. The tools for building 
search engines are widely available (Nutch etc). It is the human element 
  of the equation that is in short supply.

Many on the second and third tiers (those below GYM 
(Google/Yahoo/Microsoft)) of the search business have been talking on 
internet fora and lists for years. Having spent years developing good a 
search index, many of these people would not particularly want to give 
up such an edge. Though the wiki idea is nice, the mindset is somewhat 
different to that of Wikipedia and  the whole "Cathederal and the 
Bazzar" model. Most search engine developers are too busy trying to 
survive without having to subscribe to some happy-clappy ethos that 
could very well put them out of business. These are the guys who you 
will have to convince that there is some value to being involved in the 
Wikia search project.

> Perhaps we should all be working together and sharing resources, so  
> that any value or uniqueness that a new "mini" engine might add isn't  
> lost in the noise of duplicating all the other effort to get there.

That's all very laudable but this is a business. The small search 
engines are not going to hand over their survival edge to Jimmy's 
vision, which is essentially that of a competitor who will take their 
work an monetise it. That is the road block that the project has to get 
beyond.

> I'm more of a platform guy, and want to build a great foundation to  let 
> anyone and everyone answer the "searching for what" question,  building 
> common methods for feedback to ensure quality is happening  for everyone 
> equally, not a specific type of search application.

But without that essential spark of the search engine developers, there 
is a danger that the project could just be another platform - much like 
Amazon's search and servers product. Being a search engine developer is 
not the same as being a webdeveloper. There is a lot more thinking and 
learning involved. Most thinking is about the "searching for what" 
question. It defines the nature of the search engine being developed. It 
makes the search engine a macro search engine or a niche engine. It 
makes the difference between success and failure.

Having a platform for open search is nice. It might attract some search 
engine developers. Having a real search idea to go with that platform is 
better. Is Wikia search just an open platform without an idea for a 
search application?

Regards...jmcc
-- 
******************************************************
John McCormac  *  e-mail: jmcc at whoisireland.com
MC2            *  voice:  +353-51-873640
22 Viewmount   *  web:  http://www.whoisireland.com/
Waterford      *  blog: http://blog.whoisireland.com
Ireland        *  Irish Domain Stats & Market Research
******************************************************



More information about the Search-l mailing list