[Grub-dev] Why grub is blacklisted so widely.
Bartek Jasicki
thindil2 at gmail.com
Fri Feb 1 12:08:20 UTC 2008
> All of our urls and the meta-data (last crawled, etag, last modified,
> etc) will be stored (sorted) in hBase. When a process walks that
> list extracting urls that need to be updated/crawled and placed into
> a workunit, it fetches the robots.txt and applies all the rules
> against the urls for that host, with the optimal delay between that
> check and the workunit getting picked up being short, hours.
>
>
>
My English is little rusty, thus I'm not sure if i good understanding ;)
Client fetch only pages, don't look at robots.txt? Server check
robots.txt before add url's to workunit?
Bartek
More information about the Grub-dev
mailing list