[Grub-dev] Mistery of Wikipedia ban revealed
Bartek Jasicki
thindil2 at gmail.com
Sat Aug 2 16:02:49 UTC 2008
First - i only guess, that Wikipedia use Bad Behavior. For this i dont
have any proof. But results are similar to use Bad Behavior library.
2008-08-02, 16:04:17
Balinny <balinny at gmail.com> wrote:
> I always assumedthat Wikipedia simply has blocked queries with the
> substring Grub in the User-Agent.
> And i stand on it. See evidence below. You can even see from the
> queries that it is blocked by the squids.
> What's needed in order to crawl wikipedia is to ask the system
> administrators to lift the block (or changing
> the user-agent). I don't see the reason the C# client avoids it.
> Perhaps it's getting a cached response?
I look at results from different languages Wikipedia, always get HTTP
200 code (not 304).
> However, if some packages block it, it may be a good idea to add
> those headers. Which IMHO means the workunits
> should be changed.
>
>
Bad Behavior block it, i'm not sure about Akismet or other anti-spam
systems. And discussion about workunits format must be resurrected ;)
2008-08-03, 01:15:50
Angela <beesley at gmail.com> wrote:
> That is still the case. The exact user-agent they block is
> "grub-client". It was added many years ago and should be easy to
> appeal if proof can be shown that the problem has been fixed.
>
> http://en.wikipedia.org/robots.txt is where it's blocked.
>
> What's perhaps more interesting is that it's blocked by Wikia:
> http://www.wikia.com/robots.txt
> Like many wiki sites, we just copied Wikipedia's robots.txt a few
> years ago.
But if this blocks Grub, then server don't add Wikipedia url's to
workunits (server check robots.txt files, not clients). So, this is not
robots.txt fault (faster Squid, but why it pass C# client if it have
Grub in user agent too).
Bartek
More information about the Grub-dev
mailing list