[Grub-dev] Newbie Needs Some Pointers

jer jeremie at jabber.org
Thu Jan 17 08:16:06 UTC 2008


The wiki page I just sent has the link to define the "Internet  
Archive" version of the ARC format we are using: http:// 
www.archive.org/web/researcher/ArcFileFormat.php

It's very minimal and easy to produce, here's an example:

filedesc://dummy.arc.gz 0.0.0.0 20071005122244 text/plain 69
1 0 grub.org
URL IP-address Archive-date Content-type Archive-length

http://http://www.biboi-khalas.sulinet.hu200 127.0.0.1 19691231175959  
message/http 245
500 Internal Server Error

<html>
....

It's just then a repeating of the URL line and length, and the full  
HTTP result as the body, one after another.

Jer

On Jan 17, 2008, at 2:07 AM, Mir Tanvir Hossain wrote:

> Hello, I am kind of confused about the arc file. So the client is  
> gonna
> crawl each given urls, and append the resulting http header as well as
> any html to a single file,right?
>
> Than the program will compress the file using ARC format and upload it
> to the server. Right?
>
> Tanvir
>
>
> On Wed, 2008-01-16 at 23:30 -0800, Yousef Ourabi wrote:
>> 1) No. Client makes a simple HTTP get request with no parameters to
>> http://dispatch.grub.swlabs.org/do/workunit -- the server decides the
>> number of urls the client should fetch.
>>
>> 2) Server will give a list, which is now 250 -- but this may change .
>>
>> 3) Yes
>>
>> 4) Yes
>>
>> 5) not exactly. The client writes both the http headers and the
>> response body (the html). they must be in the same order that the
>> server gave them in the original work list.
>>
>> 6,7) Client doesn't change any html, it writes them to the ARC file
>> exactly. But basic idea is correct.
>>
>>
>> Hope this helps. Feel free to continue asking questions or on the IRC
>> channel #searchwikia
>>
>> Good luck.
>> Yousef
>>
>>
>> On 1/16/08, Mir Tanvir Hossain <mir.tanvir.hossain at gmail.com> wrote:
>>         Hello Yousef,
>>
>>         Lemme briefly write here what I have understood so far.
>>
>>         1. Client will request a list 250 urls from server.
>>         2. Server will give 250 urls and a PUT url to upload back the
>>         ARC.
>>         3. With the urls in hand, client will start crawling those  
>> 250
>>         urls.
>>         4. Client will not follow any redirects.
>>         5. client will dump all the html and check it with a known
>>         hash for any
>>         change.
>>         6.The client will make an ARC file with all the changed html
>>         pages.
>>         7. It will upload the ARC back to the server with changed  
>> html
>>         pages.
>>
>>         Am I correct? Please tell me if I am wrong and correct me.
>>
>>         Thanks again for your time.
>>
>>         Tanvir
>>
>>
>>
>>         On Wed, 2008-01-16 at 22:43 -0800, Yousef Ourabi wrote:
>>> Tanvir,
>>>
>>> The "documentation" is all in the mailing list. There is
>>         nothing more
>>> formal. Here is a brief description:
>>>
>>> client makes http get request to url
>>> server returns list of 250 urls to fetch, with user-agent
>>         the last
>>> line is an HTTP put where the client should upload the
>>         resulting arc
>>> file
>>>
>>> Clients do not follow redirects ie http 301,302,307...
>>> Clients do not parse outbound links
>>> Clients report http headers verbatim, including errors
>>>
>>>
>>> To learn about the arc format read this:
>>>
>>         http://lists.wikia.com/pipermail/grub-dev/2008-January/ 
>> 000079.html
>>>
>>> Read all other emails here:
>>>
>>         http://lists.wikia.com/pipermail/grub-dev/2008-January/ 
>> thread.html
>>>
>>> Ask many questions!
>>>
>>> Thanks,
>>> Yousef
>>>
>>> On 1/16/08, Mir Hossain <mir.tanvir.hossain at gmail.com>
>>         wrote:
>>>         Hello Yousef, Thanks for your prompt reply. I will
>>         try the
>>>         perl version right now. I know C#. May be I will try
>>         to
>>>         implement the code in C#. before that, I need to
>>         know how the
>>>         protocol works. Is there any documentation about the
>>         protocol?
>>>         Please let me know.
>>>
>>>         Thanks
>>>         Tanvir
>>>
>>>
>>>         On Jan 16, 2008 10:06 PM, Yousef Ourabi
>>>         <yourabi at zero-analog.com> wrote:
>>>                 Tanvir,
>>>                 The new SVN repository is
>>         http://svn.swlabs.org/grubng
>>>
>>>                 We are currently re-writing the code to work
>>         with the
>>>                 new RESTful API Jer (Jeremie) is
>>         implementing -- so
>>>                 both the client and the server code is a
>>         moving
>>>                 target.
>>>
>>>                 The *most* developed client is currently the
>>         perl
>>>                 client
>>         http://svn.swlabs.org/grubng/trunk/perl -- but
>>>                 many others are working on other language
>>>                 implementations of the same protocol --
>>         Balinny is
>>>                 working on a C implementation...etc
>>>
>>>                 If you are interested in learning a new
>>         language it
>>>                 might not be a bad idea to start a new
>>         language
>>>                 implementation of the protocol?
>>>
>>>                 -Yousef
>>>
>>>
>>>                 On 1/16/08, Mir Tanvir Hossain
>>>                 <mir.tanvir.hossain at gmail.com> wrote:
>>>
>>>                         Hello everybody, I have joined the
>>         mailing
>>>                         list for couple of weeks now.
>>>                         Reading the mails regularly. But I
>>         am not
>>>                         understanding that much. I am a
>>>                         Computer Science student and would
>>         like to
>>>                         contribute some code for the
>>>                         project. However, I am not sure
>>         where to
>>>                         begin. Could anybody please give
>>>                         some pointers on where can I start?
>>>
>>>                         Sincerely
>>>
>>>                         Tanvir
>>>
>>>
>>>
>>         _______________________________________________
>>>                         Grub-dev mailing list
>>>                         Grub-dev at wikia.com
>>>
>>         http://lists.wikia.com/mailman/listinfo/grub-dev
>>>
>>>
>>>
>>         _______________________________________________
>>>                 Grub-dev mailing list
>>>                 Grub-dev at wikia.com
>>>
>>         http://lists.wikia.com/mailman/listinfo/grub-dev
>>>
>>>
>>>
>>>
>>>         _______________________________________________
>>>         Grub-dev mailing list
>>>         Grub-dev at wikia.com
>>>         http://lists.wikia.com/mailman/listinfo/grub-dev
>>>
>>>
>>> _______________________________________________
>>> Grub-dev mailing list
>>> Grub-dev at wikia.com
>>> http://lists.wikia.com/mailman/listinfo/grub-dev
>>
>>         _______________________________________________
>>         Grub-dev mailing list
>>         Grub-dev at wikia.com
>>         http://lists.wikia.com/mailman/listinfo/grub-dev
>>
>> _______________________________________________
>> Grub-dev mailing list
>> Grub-dev at wikia.com
>> http://lists.wikia.com/mailman/listinfo/grub-dev
>
> _______________________________________________
> Grub-dev mailing list
> Grub-dev at wikia.com
> http://lists.wikia.com/mailman/listinfo/grub-dev



More information about the Grub-dev mailing list