[Search-l] Grub Update
jer
jeremie at jabber.org
Thu Aug 2 18:21:26 UTC 2007
> The problem with building a good index is that this kind of work is
> never really seen or heard about. The enthusiasts tend to think that
> they know how search engines work and, to a certain extent, they
> do. But
> they do not appreciate what goes into creating and maintaining a high
> quality search index. This process has to be highly automated to be
> successful as handling millions of websites is not something that
> can be
> done efficiently by hand.
I've been through the same process and very much want to point out
something: It's *both* high automation *and* human oversight/
tweaking. Exactly what you did, and I did, and everyone who's done
any amount of serious crawling has to do, is add a crap-ton of human
intelligence to the massive automation process, with constant
feedback as things jut out here and there.
This should be shared and open, we all shouldn't have to be doing
this independently. It's part of Jimmy's vision to have an open wiki
serve as the social gathering and common ground for the human side of
this crawling equation. I very much believe in it as a great way to
move the whole search industry beyond everyone having to re-do all
this work.
> The reason that most of these mini search engines fail after eighteen
> months or so is because they run into the brick wall of the
> acquisition
> problem. (Similar to that of the web directories that rely on user
> submissions.) They have to compete with search engines like Google
> that
> are far better equipped and URL detection is not the most efficient
> way
> of detecting new sites. Many new sites are not linked. It often takes
> some time for the linkbacks to appear in directories. And since Google
> has the greatest footprint, the site owners will often submit them to
> Google. This gives Google a major head start on the dwindling
> number of
> active web directories.
Perhaps we should all be working together and sharing resources, so
that any value or uniqueness that a new "mini" engine might add isn't
lost in the noise of duplicating all the other effort to get there.
> The cost of a high quality crawl is probably a magnitude or so lower
> than those estimates that have been published. Most of the ones I've
> read fail to take into consideration the numbers of duplicate, PPC,
> holding pages and assorted junk in an extension. This is the stuff
> that
> is removed in the pre-index process. They extrapolate the number of
> domains to the number of websites and work from there. The reality is
> that the webspace of most extensions is like a large, bumpy plain with
> a handful of skyscrapers and a lot of small tents. The interesting
> thing is that the ccTLDs tend to be different to the TLDs like .com
> etc.
> The Irish .ie extension had an active development figure of
> approximately 57%. I haven't worked out a figure for .uk yet but I
> would
> expect it to be somewhat higher than that of .com or .eu.
Thanks for sharing what you're learning, I wish everyone was this
open about their own discoveries even if only informally or in
discussions like this.
> Most of the work in a high quality crawl actually goes into building a
> high quality index as its starting point. It is then a process of
> continual refinement. This is why I tend to wonder about distributed
> search when there is no corresponding thought being put into the
> critical question of "searching for what?".
I'm more of a platform guy, and want to build a great foundation to
let anyone and everyone answer the "searching for what" question,
building common methods for feedback to ensure quality is happening
for everyone equally, not a specific type of search application.
Jer
More information about the Search-l
mailing list