[Search-l] (crawler) Re: call to action....

peter burden peter.burden at gmail.com
Tue Jun 12 22:59:46 UTC 2007


jer wrote:
>> Incidentally if we do build on Sami's software I can offer a crawler
>> that will do
>> 50 pages/sec using two very modest domestic PCs (one crawling/parsing 
>> and
>> one saving metadata in a MySQL database). It's written in C and is
>> multi-threaded.
>
> I'm pretty sure that would be useful all by itself if you're interested.
>
> I'm curious, since there's a few other C(or C++)-based crawlers 
> (larbin, htdig, wget) how yours might compare? Do you do anything 
> special with dns, robots, duplicate detection, rate management, spider 
> traps, hostname reduction, etc? Just wondering what aspect inspired 
> you to create another one, maybe it was just code style/control
Larbin development seems to have stopped in 2002, and htdig in 2004. 
htdig is actually a full function site
search engine.

OK it was originally developed to gather data for a research project 
many years ago and has acquired
a life of its own primarily to collect statistical data.
DNS is done using the high speed ADNS library, multiple IP addresses and 
DNS expiry dates are handled.
Robot exclusion is normal (robots.txt + meta tags), also (like Google) 
recognises "allow" and "crawl-delay",
regular expression matching as an option real soon. Matching is case 
insensitive if a Microsoft server is detected.
Duplicate detection is simple based on page "checksum"
Rate management is controlled by configuration (default is a gap of at 
least 8 seconds between successive
accesses) - doesn't dynamically track. Also honours "crawl-delay" - if 
the value is reasonable. Rate control
is applied on a per-server (as distinct from per-host) basis. Normally 
moves on to new site after the expiry
of a configurable "hold" time - i.e. after a bit go off and pester 
somebody else.
Spider traps are detected (probably not all) - the calendar trap (I call 
it an open chain) is automatically detected
and following terminated when no new data is found - long before it gets 
to June 2366 etc., - multiple directory
names are detected including examples where there are 2/3 different 
names repeated. Apache directory listings
are also recognised.
Not quite sure what you mean by host name reduction - possibly what I 
call alias detection. This is done
automatically with detection of similar as well as identical "root" pages.
Hosts can also be grouped by "institution", i.e. all the web servers at 
a university can be recognised and
grouped together based on knowledge of the allocated IP block.
URL filtering is applied on a per-host basis with simple domain matching 
(must try regex!) for acceptance
and rejection. Suffixes are also used to avoid fetching stuff that can't 
be parsed (images etc.,)
Dynamic sites are handled with automatic detection and tracking of 
session ids. This is important for Web 2.0
sites. Possibly still work to be done. Doesn't explore the "deep web".
Frame sets are dismantled. Not sure what to do about Ajax.
Cookies are accepted and returned if required. Avoids messes with URL 
rewriting done by some dynamic sites.
Saved data is in a simplified parsed canonical format suitable for 
building indexes etc., Includes character set
recognition and translation to UTF-8. ASCII approximations are also 
generated (Quebec for Que'bec etc) and
HTML such as <B>Q</B>uebec is also recognised and processed.
Links are found in Javascript and HTML - No support for other formats at 
the moment.
Metadata about URLs, pages transferred, sites explored etc., is 
maintained in a MySQL database. New URLs to
fetch are obtained from the database (in batches), the SQL used avoids 
selection of URLs already fetched. All
pages are automatically allocated a unique id - this is for the page, 
not the URL. SQL is immensely useful
for obtaining statistics about the web - average page size, age, number 
of links - distributions of the same
etc., etc.,
Database includes link table, detailing links between pages, in and out 
link counts for pages are maintained
"on-the-fly" during a crawl.
Code is ANSI-C with Posix libraries + ADNS library and MySQL client 
library. The ADNS and MySQL calls are
confined to single compilation units. About 28K KLOC.

>
> Jer
>
>





More information about the Search-l mailing list