[Search-l] Grub Update
John McCormac
jmcc at hackwatch.com
Wed Aug 1 02:34:26 UTC 2007
Jimmy Wales wrote:
> One of the first jobs for the OS version of the client is to make
> absolutely 100% sure that it behaves itself exquisitely well, both for
> the clients and for the sites being crawled.
Unfortunately it is not a question of the behaviour of Grub or any other
crawler. The owners of large directories and sites tend to be far more
aggressive now in protecting their resources. That means that many of
them are tired of scrapers and bots and will block anything outside the
Google/Yahoo/Microsoft crawlers. Others have blocked entire countries at
IP level or by extension.
> I think this misunderstands how Grub works. Grub distributes the
> crawling and checking to see if sites have changed, it does no
> distribute the decisionmaking about which sites to crawl. In this sense
> it is much more like Seti at Home than like Gnutella networks or the like.
> It is "distributed" not "peer to peer".
Again this runs into the "shoot on sight" attitude of some webmasters.
The crawler will be seen as coming from dynamic/dialup IP ranges, many
of which are already iffy due to scrapers. With the main search engines,
the IPs have proper reverse DNS so that webmasters can be certain that
they are who they claim to be.
> And YES you are 100% right - crawling is only a piece of the search
> solution. In theory a distributed crawler can spider the web more
> quickly and thoroughly than a centralized solution. And another part of
> the theory here is that be reducing the *cost* of a high quality crawl,
> it becomes possible to make the *results* of the crawl available under a
> free license. (Which, of course, Wikia will do no matter what the cost,
> because that's the whole point of what we are doing here.)
In June, I spidered the index pages from all active .eu websites from a
tracking dataset of .eu domains (approx 1.436M websites out of 1.78M
actively resolving domains from a list of 2.13M .eu domains). The aim
was to create some estimate of how many active .eu websites there were.
The results were quite startling - only about 16.13% of the domains with
websites (roughly 19.90% of the websites) were actively developed. The
data was then broken down over active websites, parked sites, holding
pages, frame src redirects etc. A similar first run on .mobi had only
10% of the websites actively developed and that was before any dupe and
holding page algorithms were applied to the data.
The problem with building a good index is that this kind of work is
never really seen or heard about. The enthusiasts tend to think that
they know how search engines work and, to a certain extent, they do. But
they do not appreciate what goes into creating and maintaining a high
quality search index. This process has to be highly automated to be
successful as handling millions of websites is not something that can be
done efficiently by hand.
The reason that most of these mini search engines fail after eighteen
months or so is because they run into the brick wall of the acquisition
problem. (Similar to that of the web directories that rely on user
submissions.) They have to compete with search engines like Google that
are far better equipped and URL detection is not the most efficient way
of detecting new sites. Many new sites are not linked. It often takes
some time for the linkbacks to appear in directories. And since Google
has the greatest footprint, the site owners will often submit them to
Google. This gives Google a major head start on the dwindling number of
active web directories.
The cost of a high quality crawl is probably a magnitude or so lower
than those estimates that have been published. Most of the ones I've
read fail to take into consideration the numbers of duplicate, PPC,
holding pages and assorted junk in an extension. This is the stuff that
is removed in the pre-index process. They extrapolate the number of
domains to the number of websites and work from there. The reality is
that the webspace of most extensions is like a large, bumpy plain with
a handful of skyscrapers and a lot of small tents. The interesting
thing is that the ccTLDs tend to be different to the TLDs like .com etc.
The Irish .ie extension had an active development figure of
approximately 57%. I haven't worked out a figure for .uk yet but I would
expect it to be somewhat higher than that of .com or .eu.
Most of the work in a high quality crawl actually goes into building a
high quality index as its starting point. It is then a process of
continual refinement. This is why I tend to wonder about distributed
search when there is no corresponding thought being put into the
critical question of "searching for what?".
Regards...jmcc
--
******************************************************
John McCormac * e-mail: jmcc at whoisireland.com
MC2 * voice: +353-51-873640
22 Viewmount * web: http://www.whoisireland.com/
Waterford * blog: http://blog.whoisireland.com
Ireland * Irish Domain Stats & Market Research
******************************************************
More information about the Search-l
mailing list