[Grub-dev] Wondering about Architecture, Bug-base, Discussions, ...

Bartek Jasicki thindil2 at gmail.com
Mon Oct 20 13:29:50 UTC 2008


2008-10-20, 00:20:02
Balinny <balinny at gmail.com> wrote:

> >> Any changes
> >> would be enhacements or refactoring, not bugs :)
> >>     
> >
> > Murphy's law: in every program is always one more bug than you
> > know ;) One bug if i good see - problems with HTTP headers when
> > client work by proxy (here is something similar): 
> > http://www.seofaststart.com/blog/google-proxy-hacking
> >
> > You don't add header
> > Pragma: no-cache (for HTTP 1.0) and 
> >
> > Cache-control: no-cache 
> > Pragma: no-cache
> > For HTTP 1.1
> >
> > Clients absolutely cannot use proxy cache for results.
> >   
> Why?
> If the proxy is standards compliant, it shouldn't matter, and if it's 
> not... it could be doing anything.
> 
> http://www.seofaststart.com/blog/google-proxy-hacking is interesting
> but asking a not-cached version wouldn't fix that. Even worse, due to
> its distributed design, the workarounds can't be applied to grub.
> 
> 

Not so fast ;) Crawler is not like browser - it must always get newest
version of page - that works web crawlers (for example googlebot can
visit this same page few times on a day). So, even for this, is better
if client don't use proxy cache.
In Grub proxy can be used only for anonymity not for fetch pages.
I give this link, to show one of problems with proxy cache ;)

> > Please check documentation for clients:
> > http://grub.org/?q=/node/140
> >
> > It is probably complete documentation what options must Grub clients
> > have ;)
> >
> > Bartek
> >   
> You state there that clients must extract the urls and upload them as 
> sitemaps.
> I thought the server already auto-feedbacks, and the sitemap thing
> was just for people wanting to add urls.
> 
> 

Yes, at this moment dispatch server can fetch url from pages in
database. But remember, there is one small problem - this operation
(extracting url's from page) get some time and resources of server.
Thus, if we have distributed web crawler, why not made it by client not
server? It can make crawling process faster and cheaper.

Bartek
-- 
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net


More information about the Grub-dev mailing list