[Grub-dev] Mistery of Wikipedia ban revealed

Balinny balinny at gmail.com
Sat Aug 2 14:04:17 UTC 2008


Bartek Jasicki wrote:
> Hi
>
> We always has a problems with crawl Wikipedia pages (servers return
> HTTP 403 error). At beginning they send this error by proxy. After
> adding Accept-Encoding header to clients request, Wikipedia servers
> starts send different page, but still with HTTP 403 code. Now i found
> where we make mistake... our clients not send Accept header.

I always assumedthat Wikipedia simply has blocked queries with the 
substring Grub in the User-Agent.
And i stand on it. See evidence below. You can even see from the queries 
that it is blocked by the squids.
What's needed in order to crawl wikipedia is to ask the system 
administrators to lift the block (or changing
the user-agent). I don't see the reason the C# client avoids it. Perhaps 
it's getting a cached response?
However, if some packages block it, it may be a good idea to add those 
headers. Which IMHO means the workunits
should be changed.



GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: Grub 1.0
Connection: close

HTTP/1.0 403 Forbidden
Server: squid/2.6.STABLE21
X-Squid-Error: ERR_ACCESS_DENIED 0
...


GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: GrubNG 20080128
Connection: close

HTTP/1.0 403 Forbidden
Server: squid/2.6.STABLE21
X-Squid-Error: ERR_ACCESS_DENIED 0
...


GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: Balinny
Connection: close

HTTP/1.0 200 OK
Date: Sat, 02 Aug 2008 07:37:46 GMT
Server: Apache
....



GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: GrubNG 20080128
Accept: */*
Accept-Encoding: gzip
Connection: close


HTTP/1.0 403 Forbidden
Server: squid/2.6.STABLE21
X-Squid-Error: ERR_ACCESS_DENIED 0
Connection: close
...



More information about the Grub-dev mailing list