[Grub-dev] Why grub is blacklisted so widely.

jer jeremie at jabber.org
Fri Feb 1 08:31:58 UTC 2008


All of our urls and the meta-data (last crawled, etag, last modified,  
etc) will be stored (sorted) in hBase.  When a process walks that  
list extracting urls that need to be updated/crawled and placed into  
a workunit, it fetches the robots.txt and applies all the rules  
against the urls for that host, with the optimal delay between that  
check and the workunit getting picked up being short, hours.

Starting out, any host that has a lot of urls in the webdb will only  
have a small subset of them placed into the current workunits, this  
way the "distributed" nature is implicitly throttled to make sure  
there isn't any of the bad behavior you mention.  For the rest of the  
urls on a large hostname they will either be crawled at the ISC with  
careful consideration to speed/sensitivity, or in the future by a  
much smarter API than the current workunit that only trusted users  
will be able to access, and the same careful sensitivity will be  
required in order to use that API.

So the best answer I have is that the architecture is centralized  
around a strong shared webdb with metadata, and there are at least  
two tiers of crawling activity to help protect from these common issues.

Do you (or anyone) have any other suggestions on this topic?  I want  
to make sure we get this right as what we're building is supposed to  
be for everyone's benefit :)

Jer

On Jan 30, 2008, at 4:48 PM, Jason Pump wrote:

> A lot of webmasters become very upset at the following behaviors by  
> web crawlers -
>
> Accessing the same pages over and over again in an unreasonable  
> time frame or the same deep-level page simultaneously from multiple  
> servers.
> Accessing many dead links or other error pages on a web site,  
> causing site statistics e.g. 500 errors / hour to set off site alerts
> Accessing a site too fast
> Accessing a page that is denied by robots.txt
> Fetching pages without first fetching robots.txt
> Fetching may more pages for their crawl then the users that use  
> that crawl ever do.
> Not refetching robots.txt frequently enough to pick up recent changes.
>
> Doing most of these things are what got grub a bad name the first  
> time around. During development phases most of these problems can  
> be avoided with good architecture and planning. Some thought and  
> discussion should perhaps be made, at this point, as to how to  
> avoid being branded as bad netizens moving forwards.
>
> Jason
>
>
>
> _______________________________________________
> Grub-dev mailing list
> Grub-dev at wikia.com
> http://lists.wikia.com/mailman/listinfo/grub-dev



More information about the Grub-dev mailing list