[Search-l] Grub Update

peter burden peter.burden at gmail.com
Wed Aug 1 23:42:06 UTC 2007


John McCormac wrote:
> Jimmy Wales wrote:
>   
>> One of the first jobs for the OS version of the client is to make 
>> absolutely 100% sure that it behaves itself exquisitely well, both for 
>> the clients and for the sites being crawled.
>>     
>
>   
In which case 'grub' has a long way to go. As far as I can tell from the 
Microsoft Visual C++
code there is no support for robot exclusion. "robots.txt" is mentioned 
in a "todo"
list and there's a function that recognises "robots.txt" in a URL, but 
the function
dosen't appear to be called anywhere. There's no mention of exclusion 
via the
<meta> tag attributes. The use of the "if-modified-since" HTTP request is
hinted at in a "todo" list but the code doesn't seem to take advantage 
of this.
I've no idea how it controls per-server traffic, possibly it relies on 
the "random"
selection of sites to "spread the load".

Crawling "at random" seems to me a bad idea for a variety of reasons. If the
randomness implies random URLs on random sites than, in order to be well
behaved the crawler needs to fetch the "robots.txt" file for each site 
prior to
fetching the actual URL creating a significant extra network and server 
overhead.
There will also be overheads from DNS lookups. [Those who have written
crawlers that I have looked at seem to have found that DNS can represent
a significant bottleneck.]

Randomness also prevents the use of cookies as a strategy to crawl
dynamic sites.

>
>> And YES you are 100% right - crawling is only a piece of the search 
>> solution.  In theory a distributed crawler can spider the web more 
>> quickly and thoroughly than a centralized solution.  And another part of 
>> the theory here is that be reducing the *cost* of a high quality crawl, 
>> it becomes possible to make the *results* of the crawl available under a 
>> free license.  (Which, of course, Wikia will do no matter what the cost, 
>> because that's the whole point of what we are doing here.)
>>     
I don't think this theory is necessarily right. If the crawl targets 
have to be distributed
from a central controlling server and the results sent back then the 
traffic level on the
central server is going to be of the same order of magnitude as if the 
central machine
crawled directly. The total network traffic will be greater as the 
crawled data has to
make two network journeys (one from remote site to crawler, one from 
crawler to
central machine).

If you don't feed all the results back to a central machine then there's 
lots of extra
network traffic as various factories, brokers and collectors talk to 
each other to
try and find documents (or references to documents) that satisfy search 
criteria.
There's also likely to be duplication of server accesses.

If the traffic from crawler to/from central machine is encoded in some 
form as
inefficient as XML then the situation is even worse.
>
> In June, I spidered the index pages from all active .eu websites from a
> tracking dataset of .eu domains (approx 1.436M websites out of 1.78M
> actively resolving domains from a list of 2.13M .eu domains). The aim
> was to create some estimate of how many active .eu websites there were.
> The results were quite startling - only about 16.13% of the domains with
> websites (roughly 19.90% of the websites) were actively developed. The
> data was then broken down over active websites, parked sites, holding
> pages, frame src redirects etc. A similar first run on .mobi had only
>   
And that's only some of the problems. ;-)
> 10% of the websites actively developed and that was before any dupe and
> holding page algorithms were applied to the data.
>
> The problem with building a good index is that this kind of work is
> never really seen or heard about. The enthusiasts tend to think that
> they know how search engines work and, to a certain extent, they do. But
> they do not appreciate what goes into creating and maintaining a high
> quality search index. This process has to be highly automated to be
> successful as handling millions of websites is not something that can be
> done efficiently by hand.
>
> The reason that most of these mini search engines fail after eighteen
> months or so is because they run into the brick wall of the acquisition
> problem. (Similar to that of the web directories that rely on user
> submissions.) They have to compete with search engines like Google that
> are far better equipped and URL detection is not the most efficient way
> of detecting new sites. Many new sites are not linked. It often takes
> some time for the linkbacks to appear in directories. And since Google
> has the greatest footprint, the site owners will often submit them to
> Google. This gives Google a major head start on the dwindling number of
> active web directories.
>
>   
Well I'm glad somebody else has spotted this. If you don't believe it note
down a few URLs off vans, lorries, buses, yellow pages, local shops etc.,
and then do an advanced Google search for pages that link to the domain
home page. I managed to get a zone transfer of ".org.uk" some time ago
and did this check - as I recall I was seeing ~20/30% of active web sites
having no incoming links according to Google.
> Most of the work in a high quality crawl actually goes into building a
> high quality index as its starting point. It is then a process of
> continual refinement. This is why I tend to wonder about distributed
> search when there is no corresponding thought being put into the
> critical question of "searching for what?".
>   
In my experience an equally significant effort is required for setting 
up and
tweaking filters to reject unwanted and irrelevant documents and avoiding
any one of several >>interesting<< spider traps.


> Regards...jmcc
>   





More information about the Search-l mailing list