[Grub-dev] Mistery of Wikipedia ban revealed

Bartek Jasicki thindil2 at gmail.com
Sat Aug 2 13:07:18 UTC 2008


Hi

We always has a problems with crawl Wikipedia pages (servers return
HTTP 403 error). At beginning they send this error by proxy. After
adding Accept-Encoding header to clients request, Wikipedia servers
starts send different page, but still with HTTP 403 code. Now i found
where we make mistake... our clients not send Accept header. Affected
clients are C (all revisions), Perl (all revisions - this client is
really obsolete, please avoid it), C# (version 0.7.3 and below) - thus
near all ;)

Reason:
There is popular php library to block spam bots called Bad Behavior -
http://www.bad-behavior.ioerror.us/
One of criteria used to detect spam bots is correctness of headers send
by client (bot, browser) to server. Scripts check not only what headers
are send, but it correctness with HTTP standard. Example: If client
send header: 

Accept-Encoding=gzip,gzip,deflate 

Bad Behavior block him as a spam bot.

Probably many other pages which send us HTTP 403 error code, have
installed Bad Behavior too, thus minimal set of headers which client
MUST send is:

Accept: */*
Accept-Encoding: gzip
Connection: close
User-Agent: (set by workunit)

Without this, many pages can detect Grub clients as a spam bots.

As a proof - Grub C# client 0.7.4 crawl Wikipedia pages without
problems (of course if this pages exist).

Bartek


More information about the Grub-dev mailing list