[Grub-dev] Mistery of Wikipedia ban revealed
Balinny
balinny at gmail.com
Sat Aug 2 14:04:17 UTC 2008
Bartek Jasicki wrote:
> Hi
>
> We always has a problems with crawl Wikipedia pages (servers return
> HTTP 403 error). At beginning they send this error by proxy. After
> adding Accept-Encoding header to clients request, Wikipedia servers
> starts send different page, but still with HTTP 403 code. Now i found
> where we make mistake... our clients not send Accept header.
I always assumedthat Wikipedia simply has blocked queries with the
substring Grub in the User-Agent.
And i stand on it. See evidence below. You can even see from the queries
that it is blocked by the squids.
What's needed in order to crawl wikipedia is to ask the system
administrators to lift the block (or changing
the user-agent). I don't see the reason the C# client avoids it. Perhaps
it's getting a cached response?
However, if some packages block it, it may be a good idea to add those
headers. Which IMHO means the workunits
should be changed.
GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: Grub 1.0
Connection: close
HTTP/1.0 403 Forbidden
Server: squid/2.6.STABLE21
X-Squid-Error: ERR_ACCESS_DENIED 0
...
GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: GrubNG 20080128
Connection: close
HTTP/1.0 403 Forbidden
Server: squid/2.6.STABLE21
X-Squid-Error: ERR_ACCESS_DENIED 0
...
GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: Balinny
Connection: close
HTTP/1.0 200 OK
Date: Sat, 02 Aug 2008 07:37:46 GMT
Server: Apache
....
GET /wiki/Wikipedia:Selected_anniversaries HTTP/1.0
Host: en.wikipedia.org
User-Agent: GrubNG 20080128
Accept: */*
Accept-Encoding: gzip
Connection: close
HTTP/1.0 403 Forbidden
Server: squid/2.6.STABLE21
X-Squid-Error: ERR_ACCESS_DENIED 0
Connection: close
...
More information about the Grub-dev
mailing list