I've got to re-read all the stuff about the arc format and incorporate it into the patch. Expect v3 sometime late tomorrow.<br><br>II'm also going to have to re-read the emails Jer just sent to the wikia mailing list to fully digest, but I really look forward to learning more about the Nutch setup wikia is using to gain the full "perspective" on the back-end aspects of wikia search.
<br><br>Per the generated work-units -- Jer: how are you generating them now? I'm assuming this isn't the current "server" but some modified version you have running? It would be great to learn a bit about your next steps around that.
<br><br>More tomorrow.<br><br>Thanks.<br>Yousef<br><br><br><div><span class="gmail_quote">On 1/10/08, <b class="gmail_sendername">jer</b> <<a href="mailto:jeremie@jabber.org">jeremie@jabber.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
>> So, I think you're right and it's missing a \n, but maybe it's<br>>> missing TWO of them?<br>>><br>>> doc == <nl><URL-record><nl><network_doc><br>>><br>
>> URL-record-v1 == <url><sp><br>>> <ip-address><sp><br>>> <archive-date><sp><br>>> <content-type><sp><br>>> <length><nl><br>>>
<br>>> So, there should be a \n before each URL record, and two of them<br>>> after it, one defined as the terminator in URL-record-v1, and one<br>>> defined as the separator between URL-record and network_doc. Is that
<br>>> correct?<br>>><br>>> print $arc "\nhttp://$host$path $ip 19691231175959 $ctype",length<br>>> ($body),"\n\n$body";<br>>><br>>> Is that correct? Can anyone else verify?
<br>>><br>> So it seems.<br><br>Anyone else can verify this is correct? \n URL-stuff \n \n CONTENT ?<br><br>>> The workunits can (someday) start to define HTTP/1.1 with a<br>>> Connection: close, and an Accept-encoding: gzip. A client supporting
<br>>> the current workunit format shouldn't care or know any different,<br>>> right?<br>>><br>> The client's bandwidth might care ;-)<br><br>Yep, easy enough to add these headers in the workunits as well :)
<br><br>>> Doh! My bad, I can fix it when I generate some more workunits :)<br>> Aren't they generated on-the-fly?<br><br>Heh, nope, there's no DB in this back-end so it's much faster and<br>easier to pre-generate batches of these from flat lists right now.
<br><br>Jer<br>_______________________________________________<br>Grub-dev mailing list<br><a href="mailto:Grub-dev@wikia.com">Grub-dev@wikia.com</a><br><a href="http://lists.wikia.com/mailman/listinfo/grub-dev">http://lists.wikia.com/mailman/listinfo/grub-dev
</a><br></blockquote></div><br>