[Grub-dev] Why grub is blacklisted so widely.
jer
jeremie at jabber.org
Fri Feb 1 08:31:58 UTC 2008
All of our urls and the meta-data (last crawled, etag, last modified,
etc) will be stored (sorted) in hBase. When a process walks that
list extracting urls that need to be updated/crawled and placed into
a workunit, it fetches the robots.txt and applies all the rules
against the urls for that host, with the optimal delay between that
check and the workunit getting picked up being short, hours.
Starting out, any host that has a lot of urls in the webdb will only
have a small subset of them placed into the current workunits, this
way the "distributed" nature is implicitly throttled to make sure
there isn't any of the bad behavior you mention. For the rest of the
urls on a large hostname they will either be crawled at the ISC with
careful consideration to speed/sensitivity, or in the future by a
much smarter API than the current workunit that only trusted users
will be able to access, and the same careful sensitivity will be
required in order to use that API.
So the best answer I have is that the architecture is centralized
around a strong shared webdb with metadata, and there are at least
two tiers of crawling activity to help protect from these common issues.
Do you (or anyone) have any other suggestions on this topic? I want
to make sure we get this right as what we're building is supposed to
be for everyone's benefit :)
Jer
On Jan 30, 2008, at 4:48 PM, Jason Pump wrote:
> A lot of webmasters become very upset at the following behaviors by
> web crawlers -
>
> Accessing the same pages over and over again in an unreasonable
> time frame or the same deep-level page simultaneously from multiple
> servers.
> Accessing many dead links or other error pages on a web site,
> causing site statistics e.g. 500 errors / hour to set off site alerts
> Accessing a site too fast
> Accessing a page that is denied by robots.txt
> Fetching pages without first fetching robots.txt
> Fetching may more pages for their crawl then the users that use
> that crawl ever do.
> Not refetching robots.txt frequently enough to pick up recent changes.
>
> Doing most of these things are what got grub a bad name the first
> time around. During development phases most of these problems can
> be avoided with good architecture and planning. Some thought and
> discussion should perhaps be made, at this point, as to how to
> avoid being branded as bad netizens moving forwards.
>
> Jason
>
>
>
> _______________________________________________
> Grub-dev mailing list
> Grub-dev at wikia.com
> http://lists.wikia.com/mailman/listinfo/grub-dev
More information about the Grub-dev
mailing list