[Grub-dev] Why grub is blacklisted so widely.

Bartek Jasicki thindil2 at gmail.com
Fri Feb 1 12:08:20 UTC 2008


> All of our urls and the meta-data (last crawled, etag, last modified,  
> etc) will be stored (sorted) in hBase.  When a process walks that  
> list extracting urls that need to be updated/crawled and placed into  
> a workunit, it fetches the robots.txt and applies all the rules  
> against the urls for that host, with the optimal delay between that  
> check and the workunit getting picked up being short, hours.
>
>
>   

My English is little rusty, thus I'm not sure if i good understanding ;)
Client fetch only pages, don't look at robots.txt? Server check
robots.txt before add url's to workunit?

Bartek



More information about the Grub-dev mailing list