[Grub-dev] Newbie Needs Some Pointers
jer
jeremie at jabber.org
Thu Jan 17 08:16:06 UTC 2008
The wiki page I just sent has the link to define the "Internet
Archive" version of the ARC format we are using: http://
www.archive.org/web/researcher/ArcFileFormat.php
It's very minimal and easy to produce, here's an example:
filedesc://dummy.arc.gz 0.0.0.0 20071005122244 text/plain 69
1 0 grub.org
URL IP-address Archive-date Content-type Archive-length
http://http://www.biboi-khalas.sulinet.hu200 127.0.0.1 19691231175959
message/http 245
500 Internal Server Error
<html>
....
It's just then a repeating of the URL line and length, and the full
HTTP result as the body, one after another.
Jer
On Jan 17, 2008, at 2:07 AM, Mir Tanvir Hossain wrote:
> Hello, I am kind of confused about the arc file. So the client is
> gonna
> crawl each given urls, and append the resulting http header as well as
> any html to a single file,right?
>
> Than the program will compress the file using ARC format and upload it
> to the server. Right?
>
> Tanvir
>
>
> On Wed, 2008-01-16 at 23:30 -0800, Yousef Ourabi wrote:
>> 1) No. Client makes a simple HTTP get request with no parameters to
>> http://dispatch.grub.swlabs.org/do/workunit -- the server decides the
>> number of urls the client should fetch.
>>
>> 2) Server will give a list, which is now 250 -- but this may change .
>>
>> 3) Yes
>>
>> 4) Yes
>>
>> 5) not exactly. The client writes both the http headers and the
>> response body (the html). they must be in the same order that the
>> server gave them in the original work list.
>>
>> 6,7) Client doesn't change any html, it writes them to the ARC file
>> exactly. But basic idea is correct.
>>
>>
>> Hope this helps. Feel free to continue asking questions or on the IRC
>> channel #searchwikia
>>
>> Good luck.
>> Yousef
>>
>>
>> On 1/16/08, Mir Tanvir Hossain <mir.tanvir.hossain at gmail.com> wrote:
>> Hello Yousef,
>>
>> Lemme briefly write here what I have understood so far.
>>
>> 1. Client will request a list 250 urls from server.
>> 2. Server will give 250 urls and a PUT url to upload back the
>> ARC.
>> 3. With the urls in hand, client will start crawling those
>> 250
>> urls.
>> 4. Client will not follow any redirects.
>> 5. client will dump all the html and check it with a known
>> hash for any
>> change.
>> 6.The client will make an ARC file with all the changed html
>> pages.
>> 7. It will upload the ARC back to the server with changed
>> html
>> pages.
>>
>> Am I correct? Please tell me if I am wrong and correct me.
>>
>> Thanks again for your time.
>>
>> Tanvir
>>
>>
>>
>> On Wed, 2008-01-16 at 22:43 -0800, Yousef Ourabi wrote:
>>> Tanvir,
>>>
>>> The "documentation" is all in the mailing list. There is
>> nothing more
>>> formal. Here is a brief description:
>>>
>>> client makes http get request to url
>>> server returns list of 250 urls to fetch, with user-agent
>> the last
>>> line is an HTTP put where the client should upload the
>> resulting arc
>>> file
>>>
>>> Clients do not follow redirects ie http 301,302,307...
>>> Clients do not parse outbound links
>>> Clients report http headers verbatim, including errors
>>>
>>>
>>> To learn about the arc format read this:
>>>
>> http://lists.wikia.com/pipermail/grub-dev/2008-January/
>> 000079.html
>>>
>>> Read all other emails here:
>>>
>> http://lists.wikia.com/pipermail/grub-dev/2008-January/
>> thread.html
>>>
>>> Ask many questions!
>>>
>>> Thanks,
>>> Yousef
>>>
>>> On 1/16/08, Mir Hossain <mir.tanvir.hossain at gmail.com>
>> wrote:
>>> Hello Yousef, Thanks for your prompt reply. I will
>> try the
>>> perl version right now. I know C#. May be I will try
>> to
>>> implement the code in C#. before that, I need to
>> know how the
>>> protocol works. Is there any documentation about the
>> protocol?
>>> Please let me know.
>>>
>>> Thanks
>>> Tanvir
>>>
>>>
>>> On Jan 16, 2008 10:06 PM, Yousef Ourabi
>>> <yourabi at zero-analog.com> wrote:
>>> Tanvir,
>>> The new SVN repository is
>> http://svn.swlabs.org/grubng
>>>
>>> We are currently re-writing the code to work
>> with the
>>> new RESTful API Jer (Jeremie) is
>> implementing -- so
>>> both the client and the server code is a
>> moving
>>> target.
>>>
>>> The *most* developed client is currently the
>> perl
>>> client
>> http://svn.swlabs.org/grubng/trunk/perl -- but
>>> many others are working on other language
>>> implementations of the same protocol --
>> Balinny is
>>> working on a C implementation...etc
>>>
>>> If you are interested in learning a new
>> language it
>>> might not be a bad idea to start a new
>> language
>>> implementation of the protocol?
>>>
>>> -Yousef
>>>
>>>
>>> On 1/16/08, Mir Tanvir Hossain
>>> <mir.tanvir.hossain at gmail.com> wrote:
>>>
>>> Hello everybody, I have joined the
>> mailing
>>> list for couple of weeks now.
>>> Reading the mails regularly. But I
>> am not
>>> understanding that much. I am a
>>> Computer Science student and would
>> like to
>>> contribute some code for the
>>> project. However, I am not sure
>> where to
>>> begin. Could anybody please give
>>> some pointers on where can I start?
>>>
>>> Sincerely
>>>
>>> Tanvir
>>>
>>>
>>>
>> _______________________________________________
>>> Grub-dev mailing list
>>> Grub-dev at wikia.com
>>>
>> http://lists.wikia.com/mailman/listinfo/grub-dev
>>>
>>>
>>>
>> _______________________________________________
>>> Grub-dev mailing list
>>> Grub-dev at wikia.com
>>>
>> http://lists.wikia.com/mailman/listinfo/grub-dev
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Grub-dev mailing list
>>> Grub-dev at wikia.com
>>> http://lists.wikia.com/mailman/listinfo/grub-dev
>>>
>>>
>>> _______________________________________________
>>> Grub-dev mailing list
>>> Grub-dev at wikia.com
>>> http://lists.wikia.com/mailman/listinfo/grub-dev
>>
>> _______________________________________________
>> Grub-dev mailing list
>> Grub-dev at wikia.com
>> http://lists.wikia.com/mailman/listinfo/grub-dev
>>
>> _______________________________________________
>> Grub-dev mailing list
>> Grub-dev at wikia.com
>> http://lists.wikia.com/mailman/listinfo/grub-dev
>
> _______________________________________________
> Grub-dev mailing list
> Grub-dev at wikia.com
> http://lists.wikia.com/mailman/listinfo/grub-dev
More information about the Grub-dev
mailing list