[Grub-dev] grub server/clientt

bruce bedouglas at earthlink.net
Sat Jan 10 15:41:13 UTC 2009


hi again bartek!

thanks for the replies on this one....

so, as i understand:

-the full code for the server, and client apps for grub are open
 source, and can be downloaded from the grub.org site
-the server/client architecture is such that the server basically 
 maintains a list of urls to fetch, and it replies to requests 
 from the clients, distributing the urls on a 1st come, 1st served 
 basis.
-the client app is a "dumb" app that fetches the url(s) that it  
 has fetched from the server

a few questions:
-does the server do any kind of quality assurance, checking on 
 the returned data from the client url fetch. is there any kind 
 of built in redundancy for fetchin the same url from multiple 
 clients to assure that the data/page content is valid
-does the app permit multiple clients to be run on a given 
 client server at the same time (simultaneous clients running 
 on the same server)
-does the server track the status/health of the overall 
 client servers/client apps for the network?

in evaluating BOINC, it appears that BOINC doesn't easily permit multiple boinc client apps to be run in a simultaneous manner, which means i'd have to craft a client, than in effect would spawn off child threads/processes on the local machine which would perform the actual work. this would cause issues, as the page fetch of some pages might complete, and the app would essentially have to wait for the stragglers to complete... if i could have a client process, that wouls continually go back to the server to fetch data, based on the available system resources... then i could maximize the client servers for this function....

my hope is that grub might handle this (or be able to be adapted to handle this) easily if i can't accomplish it with BOINC.

thoughts/comments/etc..

-bruce


-----Original Message-----
From: grub-dev-bounces at wikia.com [mailto:grub-dev-bounces at wikia.com]On
Behalf Of Bartek Jasicki
Sent: Friday, January 09, 2009 11:51 AM
To: grub-dev at wikia.com
Subject: Re: [Grub-dev] grub server/clientt


Dnia 2009-01-09, o godz. 10:09:20
"bruce" <bedouglas at earthlink.net> napisał(a):

> hi Bartek,
> 

Hi Bruce

> thanks for the reply.
> 
> ok. sounds likt this might be useful. here's my situation. i have a
> group of sites that i want to parse, and i've developed small parsing
> scripts that parse the sites, and drill down, get behind the forms,
> use passwords, etc.. to get my information. each page of the parsing
> process, generates a separate page (which i need to parse to get the
> links to the next level of parsing...) my scripts currently handle
> this process.
> 
> so i can essentially call my script, passing it the information
> needed to parse the next level, until i get to the final level that i
> care about.
> 
> so if i uniderstand what you posted below, i could use grub, starting
> with a list of initial urls that i want to parse. i then create my
> 'workunits' which are then used by the client app to fetch the url's
> page/text.
> 
> in this case, a client would then fetch the text, and return it in the
> ".arc" file to the server. is this correct?
> 

When Grub client fetch page it get only this page to which
you point him (don't follow any other). So, if you been have in your
url's list address http://grub.org it crawl only it, not (for example)
http://grub.org/?q=pl/user/2 or http://grub.org/?q=pl/project/issues 
This is main difference between Grub and other web crawlers (like
Googlebot). IIRC, at this moment we have database with url's to
crawl from which are created workunits. And fetching new url's from
crawled page have only C# client (this option is in early stage of
development).

So, if you have all needed url's to crawl, you can use our client and
servers (after small modification of course). If you want discover new
links during crawling - you must make much more work (especially with
client).

> for my needs, i'd like to be able to modify the client process for my
> needs. in my model, the client process would request the "url" or
> multiple "urls" from the server in the form of the 'workunit', and
> then the client process would call/invoke my python scripts on the
> client machine.
> 

You can modify as you want our code - Grub is Free/Open Source
Software.

<spam>
Again i start rant ;) Balinny, Jeremie, please, select licenses for
your code - at this moment legal situation with C client and all Perl
scripts IMO is unclear: theoretically, your code is Open Source but by
international law, it look like closed source code (in accordance with
Berne Convention, only you have full copyright to code).
</spam>

> i envision running all of this using the amazon/google cloud service,
> so security isn't an issue...
> 
> this is along the same process that i'm considering using the BOINC
> process.
> 
> does grub have a python client on the client machine?
> 

Yes, there is console Python client:

http://grub.org/?q=/node/204

It can be good base to start work ;)

> thanks
> 
> -bruce
> 
> 

Regards

Bartek

-- 
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net
Jabber: thindil at jabberpl.org
_______________________________________________
Grub-dev mailing list
Grub-dev at wikia.com
http://lists.wikia.com/mailman/listinfo/grub-dev



More information about the Grub-dev mailing list