[Grub-dev] LZMA support in upload server
Jeremie Miller
jeremie at jabber.org
Wed Feb 25 22:06:21 UTC 2009
> Hmm, then why i still get workunits with old format? ;)
Do we want to add that to just the /dev flag'd requests?
>> I wasn't taking into account robots.txt Although if it's not
>> crawling,
>> there's no much need to
>> check robots.txt
>>
>
> Server must check robots.txt before add URLs to workunit - so, for
> this
> reason for me this is part of server.
http://svn.swlabs.org/grubng/trunk/perl/robo.pl
>>> About urls: AFAIK (or again something was changed without public
>>> announcement ;) Jeremie, please fix me if i wrong) Grub have own
>>> database with URLs from which workunit generator get URLs to fetch.
>>>
>> So, there's a database with URLs. How to get to the database? There's
>> workunit.pl but it
>> just reads URLs from stdin...
>
> AFAIK - this part is missing in repository. But remember, i'm not 100%
> sure about this, how dispatch server works.
They're derived from an initial dump of urls from here from last year:
http://index.isc.org/download/index
And supposed to be regularly including submissions from here (but it's
manual right now):
http://dispatch.grub.org/maps/
>>> And only one way to get new URLs in system are sitemaps generated by
>>> clients or send by users (during creating workunits, Grub not
>>> connect to Nutch).
>>>
>> How sad. Is it really not taking into account the page contents?
>
> Probably not. Jeremie?
It's only getting urls from the sitemap submissions, which your client
is creating new ones automatically from contents so it sort of
works... eventually yes a mapreduce job should be getting them
straight from the arcs, but right now we have more urls than crawling
so getting even more isn't that useful, yet :)
Jer
More information about the Grub-dev
mailing list