[Grub-dev] LZMA support in upload server

Jeremie Miller jeremie at jabber.org
Wed Feb 25 22:06:21 UTC 2009


> Hmm, then why i still get workunits with old format? ;)

Do we want to add that to just the /dev flag'd requests?

>> I wasn't taking into account robots.txt Although if it's not  
>> crawling,
>> there's no much need to
>> check robots.txt
>>
>
> Server must check robots.txt before add URLs to workunit - so, for  
> this
> reason for me this is part of server.

http://svn.swlabs.org/grubng/trunk/perl/robo.pl

>>> About urls: AFAIK (or again something was changed without public
>>> announcement ;) Jeremie, please fix me if i wrong) Grub have own
>>> database with URLs from which workunit generator get URLs to fetch.
>>>
>> So, there's a database with URLs. How to get to the database? There's
>> workunit.pl but it
>> just reads URLs from stdin...
>
> AFAIK - this part is missing in repository. But remember, i'm not 100%
> sure about this, how dispatch server works.

They're derived from an initial dump of urls from here from last year:
	http://index.isc.org/download/index
And supposed to be regularly including submissions from here (but it's  
manual right now):
	http://dispatch.grub.org/maps/

>>> And only one way to get new URLs in system are sitemaps generated by
>>> clients or send by users (during creating workunits, Grub not
>>> connect to Nutch).
>>>
>> How sad. Is it really not taking into account the page contents?
>
> Probably not. Jeremie?

It's only getting urls from the sitemap submissions, which your client  
is creating new ones automatically from contents so it sort of  
works... eventually yes a mapreduce job should be getting them  
straight from the arcs, but right now we have more urls than crawling  
so getting even more isn't that useful, yet :)

Jer



More information about the Grub-dev mailing list