[Grub-dev] grub server/clientt
bruce
bedouglas at earthlink.net
Sat Jan 10 15:41:13 UTC 2009
hi again bartek!
thanks for the replies on this one....
so, as i understand:
-the full code for the server, and client apps for grub are open
source, and can be downloaded from the grub.org site
-the server/client architecture is such that the server basically
maintains a list of urls to fetch, and it replies to requests
from the clients, distributing the urls on a 1st come, 1st served
basis.
-the client app is a "dumb" app that fetches the url(s) that it
has fetched from the server
a few questions:
-does the server do any kind of quality assurance, checking on
the returned data from the client url fetch. is there any kind
of built in redundancy for fetchin the same url from multiple
clients to assure that the data/page content is valid
-does the app permit multiple clients to be run on a given
client server at the same time (simultaneous clients running
on the same server)
-does the server track the status/health of the overall
client servers/client apps for the network?
in evaluating BOINC, it appears that BOINC doesn't easily permit multiple boinc client apps to be run in a simultaneous manner, which means i'd have to craft a client, than in effect would spawn off child threads/processes on the local machine which would perform the actual work. this would cause issues, as the page fetch of some pages might complete, and the app would essentially have to wait for the stragglers to complete... if i could have a client process, that wouls continually go back to the server to fetch data, based on the available system resources... then i could maximize the client servers for this function....
my hope is that grub might handle this (or be able to be adapted to handle this) easily if i can't accomplish it with BOINC.
thoughts/comments/etc..
-bruce
-----Original Message-----
From: grub-dev-bounces at wikia.com [mailto:grub-dev-bounces at wikia.com]On
Behalf Of Bartek Jasicki
Sent: Friday, January 09, 2009 11:51 AM
To: grub-dev at wikia.com
Subject: Re: [Grub-dev] grub server/clientt
Dnia 2009-01-09, o godz. 10:09:20
"bruce" <bedouglas at earthlink.net> napisał(a):
> hi Bartek,
>
Hi Bruce
> thanks for the reply.
>
> ok. sounds likt this might be useful. here's my situation. i have a
> group of sites that i want to parse, and i've developed small parsing
> scripts that parse the sites, and drill down, get behind the forms,
> use passwords, etc.. to get my information. each page of the parsing
> process, generates a separate page (which i need to parse to get the
> links to the next level of parsing...) my scripts currently handle
> this process.
>
> so i can essentially call my script, passing it the information
> needed to parse the next level, until i get to the final level that i
> care about.
>
> so if i uniderstand what you posted below, i could use grub, starting
> with a list of initial urls that i want to parse. i then create my
> 'workunits' which are then used by the client app to fetch the url's
> page/text.
>
> in this case, a client would then fetch the text, and return it in the
> ".arc" file to the server. is this correct?
>
When Grub client fetch page it get only this page to which
you point him (don't follow any other). So, if you been have in your
url's list address http://grub.org it crawl only it, not (for example)
http://grub.org/?q=pl/user/2 or http://grub.org/?q=pl/project/issues
This is main difference between Grub and other web crawlers (like
Googlebot). IIRC, at this moment we have database with url's to
crawl from which are created workunits. And fetching new url's from
crawled page have only C# client (this option is in early stage of
development).
So, if you have all needed url's to crawl, you can use our client and
servers (after small modification of course). If you want discover new
links during crawling - you must make much more work (especially with
client).
> for my needs, i'd like to be able to modify the client process for my
> needs. in my model, the client process would request the "url" or
> multiple "urls" from the server in the form of the 'workunit', and
> then the client process would call/invoke my python scripts on the
> client machine.
>
You can modify as you want our code - Grub is Free/Open Source
Software.
<spam>
Again i start rant ;) Balinny, Jeremie, please, select licenses for
your code - at this moment legal situation with C client and all Perl
scripts IMO is unclear: theoretically, your code is Open Source but by
international law, it look like closed source code (in accordance with
Berne Convention, only you have full copyright to code).
</spam>
> i envision running all of this using the amazon/google cloud service,
> so security isn't an issue...
>
> this is along the same process that i'm considering using the BOINC
> process.
>
> does grub have a python client on the client machine?
>
Yes, there is console Python client:
http://grub.org/?q=/node/204
It can be good base to start work ;)
> thanks
>
> -bruce
>
>
Regards
Bartek
--
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net
Jabber: thindil at jabberpl.org
_______________________________________________
Grub-dev mailing list
Grub-dev at wikia.com
http://lists.wikia.com/mailman/listinfo/grub-dev
More information about the Grub-dev
mailing list