[Search-l] Grub Update
John McCormac
jmcc at hackwatch.com
Thu Aug 2 21:18:05 UTC 2007
jer wrote:
> I've been through the same process and very much want to point out
> something: It's *both* high automation *and* human oversight/ tweaking.
> Exactly what you did, and I did, and everyone who's done any amount of
> serious crawling has to do, is add a crap-ton of human intelligence to
> the massive automation process, with constant feedback as things jut
> out here and there.
Yes Jer, but you don't know what I've done and vice versa. So I don't
know if we have been through quite the same process. :)
The thing about search on this or any other significant scale is that it
requires a completely different mindset to that required for building a
web directory or wiki where each entry can be individually validated. I
don't think that adding a 'crap-ton' of human intelligence to the
process is an accurate description of what happens.
The indications that mark a site for deletion tend to be clear and it is
the speed on which this happens that is important. Sometimes, this has
to be applied to every website on an IP or even on the same DNS. It is a
very anti-democratic process. Some are easy wins - linkswamps that can
be identified by a DNS or IP. PPC that can be identified from a
particular string, duplicate content pages that all have the same MD5
hash etc. The hard part is when it goes beyond the easy wins to the
stuff that requires a human decision.
> This should be shared and open, we all shouldn't have to be doing this
> independently. It's part of Jimmy's vision to have an open wiki serve
> as the social gathering and common ground for the human side of this
> crawling equation. I very much believe in it as a great way to move
> the whole search industry beyond everyone having to re-do all this work.
Some of us (those lucky enough to survive in the search engine wars)
have been doing this kind of work independently for years. We do talk
to each other but there is a slight attitude of "better him than me"
when some other search engine venture goes dot.bomb. Some of the
techniques and methodology of search engine development are closely held
- none more closely than a good search index. The tools for building
search engines are widely available (Nutch etc). It is the human element
of the equation that is in short supply.
Many on the second and third tiers (those below GYM
(Google/Yahoo/Microsoft)) of the search business have been talking on
internet fora and lists for years. Having spent years developing good a
search index, many of these people would not particularly want to give
up such an edge. Though the wiki idea is nice, the mindset is somewhat
different to that of Wikipedia and the whole "Cathederal and the
Bazzar" model. Most search engine developers are too busy trying to
survive without having to subscribe to some happy-clappy ethos that
could very well put them out of business. These are the guys who you
will have to convince that there is some value to being involved in the
Wikia search project.
> Perhaps we should all be working together and sharing resources, so
> that any value or uniqueness that a new "mini" engine might add isn't
> lost in the noise of duplicating all the other effort to get there.
That's all very laudable but this is a business. The small search
engines are not going to hand over their survival edge to Jimmy's
vision, which is essentially that of a competitor who will take their
work an monetise it. That is the road block that the project has to get
beyond.
> I'm more of a platform guy, and want to build a great foundation to let
> anyone and everyone answer the "searching for what" question, building
> common methods for feedback to ensure quality is happening for everyone
> equally, not a specific type of search application.
But without that essential spark of the search engine developers, there
is a danger that the project could just be another platform - much like
Amazon's search and servers product. Being a search engine developer is
not the same as being a webdeveloper. There is a lot more thinking and
learning involved. Most thinking is about the "searching for what"
question. It defines the nature of the search engine being developed. It
makes the search engine a macro search engine or a niche engine. It
makes the difference between success and failure.
Having a platform for open search is nice. It might attract some search
engine developers. Having a real search idea to go with that platform is
better. Is Wikia search just an open platform without an idea for a
search application?
Regards...jmcc
--
******************************************************
John McCormac * e-mail: jmcc at whoisireland.com
MC2 * voice: +353-51-873640
22 Viewmount * web: http://www.whoisireland.com/
Waterford * blog: http://blog.whoisireland.com
Ireland * Irish Domain Stats & Market Research
******************************************************
More information about the Search-l
mailing list