[Grub-dev] tiered client, other thoughts...

jer jeremie at jabber.org
Thu Jan 10 07:03:28 UTC 2008


> So as I play around with my currently less broken patched babygrub,  
> a few ideas are floating around that I wanted to share with the  
> list and get feedback.

Keep it coming :)

> 1) Have tiered client modes: The current implementation holds the  
> clients at arms length by specifying they should not follow  
> redirects, or follow links. There are strong advantages to this  
> approach as it keeps the overall design simple.

Yes, the current workunit format is designed to be the absolute  
minimal, most simplistic base.  I figure, if we can make this work  
first, then we can explore more advanced concepts while the basics  
keep chugging along :)

> However, it might make sense to incorporate the notion that not all  
> clients are equal. For example,for  someone with weaker hardware on  
> a slower connection fetching 250 pages might be a header burden,

The 250 is just a default, it's easiest to pre-generate these for now  
but it can change if there's a better number.

> so instead there could be a "validation" tier of clients that  
> simply do HEAD requests and either validate current existence, or  
> report and error without.

So, the importance of the workunits including the headers is so that  
if-modified-since and etags can be sent along (once we have them the  
first time), essentially a HEAD.

> There could be "super" clients that are in some way "authenticated"  
> or vetted by the grub server that take more of the processing  
> burden, such as parsing the page for outbound links...etc.

Totally thinking the same way here, in fact, both a good history of  
contributions and maybe even trust factors from the social network  
stuff can be combined to let users access more advanced workunits in  
the future.

Along these same lines, once there are more star-ratings of urls, the  
important ones can go to trusted users, and new/virgin or flagged  
users can still crawl but only new or lower ranked urls.   Tiered  
workunits based on the importance of the urls, matched to the best  
users.

> 2) Ability to report failure -- every time a request is made to the  
> dispatcher, it generates a new work-list. What if the client thread  
> is stopped for some reason, and the admin wants to explicitly re- 
> fetch the last list (I'm thinking of me in my debug mode right  
> now). Does this functionality exist? IF not it should.

That's not a big deal, it's supposed to be sloppy, work units can  
fail or get lost.  If the urls aren't getting crawled when they are  
supposed to they'll just get placed very quickly into another  
workunit (or maybe be in multiple already if they're important).

> 3) Significance of result order -- I have somewhat mixed feelings  
> on this. Having the hash of the URIs in order is cool,. but I'm  
> wondering if it would be just as effective if they were out of  
> order, because it shouldn't make a difference if I crawl hosts a,b,  
> and c in that order or c,a,b.

At least with this workunit format, it's pretty important to have  
that simple ordered-result hash validation... it's the easiest way to  
make sure urls aren't getting injected, while keeping the format  
super simple and not requiring any hashing or crypto logic on the  
client.  Again, as above, once we get the basics working, we can work  
on more advanced models :)

> Also the individual URI is a nice unit to divide work in a multi- 
> threaded client, which would then place the burden of re-ordering  
> the results on the client side just so the hash can match...  I  
> don't know, maybe I just need to be convinced some more on this.

So, the client can still process them threaded, as long as it re- 
assembles them in the resulting arc... it'll be more work to pull it  
apart and reassemble in the right order so it might not be worth it.   
The best alternative that I highly suggest, is just getting multiple  
workunits in parallel, fetch 10, 20, or even 50 at once, should be  
fine :)

Jer



More information about the Grub-dev mailing list