[Grub-dev] grub server/clientt
Bartek Jasicki
thindil2 at gmail.com
Sun Jan 11 11:11:16 UTC 2009
Dnia 2009-01-10, o godz. 14:31:54
"bruce" <bedouglas at earthlink.net> napisał(a):
> hi bartek!
>
Hi Bruce
> a possible solution would be to have a separate client (mgr) app, that
> managed the spawning of local client apps, that would interact with
> the server to fetch the pages...
>
Something similar to BOINC client? ;) It can works with C or Perl
clients - which run one crawler on one process. But C# and Python
clients runs simultaneously few crawlers on one process (both are
multithread applications), so, this "manager" simply been build in
client ;) This all depends on client design.
> possible architecture:
>
> master server (grub server)
> -allow it to work as it does
> -doles out urls to client requests as required
> -tracks which client machine has which url
> -accepts updates from the client apps running on
> client servers
> -tracks which client apps are uploading data, from
> which client servers
> -etc...
>
Current Grub servers have all this options ;)
> client server/app (grub client)
> -allow client to accept url/multiple urls from master grub server
> -spawn off child fetch_client to fetch/parse the url from the
> grub_client
> -client spawns as many fetch_client apps as the system resources
> allow
> -client spawns a client_server_health process as well, to
> track/report health of the client server/client processes
> -etc..
>
As i wrote above, few clients have this options ;)
> client_server_health app/process
> -new app to ping various system processes/services to
> determine health/status of the client server
> -app monitors the process tbl for the current status of the
> spawned fetch_client processes
> -figure out a method/process to determine the health/status
> of a fetch_client process.
> (possibly have the client write to a file
> -fetch_name, parse_level, time, status
> would allow for determining if an app is running,
> dead, stopped, etc..)
> -etc...
>
IMO (which can be wrong, of course) it can be put to above category
(grub client). Separate processes have more disadvantages - they
slower, more difficult communication between, more difficult to
maintain.
>
> fetch_client app/process
> -new app used to actually fetch the url passed in the args
> -spawned from the grub client
> -performs the fetching of the actual url, and the parsing of
> data for the given page
> -creates result file of the fetched data
> -different app, based on college, and level of the college
> -app returns the resulting file to the master server
> -etc...
>
Yes, this is which we call Grub client ;)
>
> thoughts/comments...
>
> -bruce
>
I don't want be rude, so, if you feel offended then i apologize, but i
suggest that you download few Grub clients and check how they works.
All clients show infomations about work progress, C# and Python can run
few crawlers simultaneously, in C# client you can manually set network
usage (cpu and memory usage can be set by amount of running crawlers).
So probably in general Grub have this what you want, he don't have only
few details (like crawl in this same cycle links from fetched pages -
but this can be dangerous. I saw few times how GoogleBot fall in
infinite loop using 300 MB of bandwidth on page which was 1 kB).
Bartek
--
Grub Next Generation: http://grub.org
Mailing List: grub-dev at wikia.com
IRC: #wikia-search at irc.freenode.net
Jabber: thindil at jabberpl.org
More information about the Grub-dev
mailing list