[Grub-dev] grub server/clientt

Bartek Jasicki thindil2 at gmail.com
Fri Jan 9 19:50:57 UTC 2009


Dnia 2009-01-09, o godz. 10:09:20
"bruce" <bedouglas at earthlink.net> napisał(a):

> hi Bartek,
> 

Hi Bruce

> thanks for the reply.
> 
> ok. sounds likt this might be useful. here's my situation. i have a
> group of sites that i want to parse, and i've developed small parsing
> scripts that parse the sites, and drill down, get behind the forms,
> use passwords, etc.. to get my information. each page of the parsing
> process, generates a separate page (which i need to parse to get the
> links to the next level of parsing...) my scripts currently handle
> this process.
> 
> so i can essentially call my script, passing it the information
> needed to parse the next level, until i get to the final level that i
> care about.
> 
> so if i uniderstand what you posted below, i could use grub, starting
> with a list of initial urls that i want to parse. i then create my
> 'workunits' which are then used by the client app to fetch the url's
> page/text.
> 
> in this case, a client would then fetch the text, and return it in the
> ".arc" file to the server. is this correct?
> 

When Grub client fetch page it get only this page to which
you point him (don't follow any other). So, if you been have in your
url's list address http://grub.org it crawl only it, not (for example)
http://grub.org/?q=pl/user/2 or http://grub.org/?q=pl/project/issues 
This is main difference between Grub and other web crawlers (like
Googlebot). IIRC, at this moment we have database with url's to
crawl from which are created workunits. And fetching new url's from
crawled page have only C# client (this option is in early stage of
development).

So, if you have all needed url's to crawl, you can use our client and
servers (after small modification of course). If you want discover new
links during crawling - you must make much more work (especially with
client).

> for my needs, i'd like to be able to modify the client process for my
> needs. in my model, the client process would request the "url" or
> multiple "urls" from the server in the form of the 'workunit', and
> then the client process would call/invoke my python scripts on the
> client machine.
> 

You can modify as you want our code - Grub is Free/Open Source
Software.

<spam>
Again i start rant ;) Balinny, Jeremie, please, select licenses for
your code - at this moment legal situation with C client and all Perl
scripts IMO is unclear: theoretically, your code is Open Source but by
international law, it look like closed source code (in accordance with
Berne Convention, only you have full copyright to code).
</spam>

> i envision running all of this using the amazon/google cloud service,
> so security isn't an issue...
> 
> this is along the same process that i'm considering using the BOINC
> process.
> 
> does grub have a python client on the client machine?
> 

Yes, there is console Python client:

http://grub.org/?q=/node/204

It can be good base to start work ;)

> thanks
> 
> -bruce
> 
> 

Regards

Bartek

-- 
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net
Jabber: thindil at jabberpl.org


More information about the Grub-dev mailing list