[Grub-dev] C# Grubng 0.1 and few questions

seth seth at untethered.org
Mon Jan 28 02:15:01 UTC 2008


On Jan 27, 2008, at 5:47 PM, Balinny wrote:
> It seems everything with gz extension is accepted on  /arc/
>
> $ putarchive hotbits soap.grub.org /arcs/FooBar.gz
> HTTP/1.1 200
> Date: Sun, 27 Jan 2008 22:46:46 GMT
> Server: Apache/1.3.28 (Unix)
> Connection: close
> Content-Type: text/html
>
> <HEAD><TITLE>OK</TITLE></HEAD><H1>Content Accepted</H1>
>
>
> (hotbits file are 2048 random bytes from
> http://fourmilab.ch/hotbits/generate.html)
>
> However, i'm unable to upload
> Balinny.XXXXdde0b5c6490a7a229ae4c7516d962933be64.arc.gz :S
>
> The file is uploaded, i can download it and it's ok, but the server
> never answers, so the uploader stales waiting for the reply.
> Behaviour is consistent with that content (which seems corrupted, like
> working only /sometimes/, i'll need to recheck that).
>

Jer has been prodding me to comment on what's going on with the grub  
back end work all week.  I kept pushing back saying I wanted something  
to 'show' before I said anything.  But...given recent comments, here's  
some insight into what's going on.

The first script that accepted .arc.gz files from users was written  
very quickly.  All the script did was store these files to disk; later  
we were going to parse out the content, verify hashes, stuff like  
that.  So really all a 200 response from PUTing told you was: our  
disks worked :)

This week I started tinkering with the script that accepts the files.   
Now after it save your arc file to disk it does open it and attempts  
to separate out each URL and insert them into an HBase table.  If you  
were really paying attention this week you would notice that the time  
it takes to post a file has been steadily increasing (now should take  
~17-25 seconds) as this 2nd part of the script attempts to take apart  
your file and insert it as rows into hbase.

We are still storing the original file that was submitted to us.   
HBase is turning out to be a bit...less stable then we'd like at  
times.  More then once I've done simple things like 'truncate table'  
or 'alter table add column' and had the entire table disappear on  
me...for no explainable reason.  Once I even had the 'meta' table  
become corrupt and I had to basically reset the entire database  
(loosing all tables).  And a good example would be right now when  
apparently in the last day or so one of the region servers stopped  
working correctly and the master didn't attempt to relocate it.  So,  
any interaction with rows that belonged to that region simply hung.   
Some research into this error and I think I can keep this one from  
happening again.

One thing I really need to clear up is the responses this script  
gives.  As I said, the first thing the script does is *save* your  
file, then it attempts to load it into hbase.  So, if you transmitted  
all the bytes, we likely have the file and got stalled in our  
interaction with hbase.  Right now if you get a 500 error from the  
server *after* transmitting the file, I can almost assure you it's a  
problem with the hbase bits and that we did get and store the file.

Hopefully in the next week you'll start seeing some kind of document  
that is kicked back that indicates how many of your URLs were accepted  
into hbase.  I also hope to be adding some code to make injection of  
data a little more difficult (basically, rate limiting how many times  
in one day we can see the same document come in...and verifying the  
hash).  I also want to get a script up that'll let you type in a URL  
and see the cached versions we have for that URL.  But, all of these  
actions require that I add some missing functions to HBase's REST API,  
and I've been avoiding doing that all weekend :).

For now though if you're getting 500 errors or the upload hangs you  
can know who to blame.  Feel free to ping me in IRC if you really want  
(I go by untethered), and curse the systems administrator that decided  
to avoid systems admin work by writing code.

>
>> And one thing - maybe funny, maybe weird - but wiki projects  
>> (wikipedia,
>> wikinews, wiktionary, wikibooks) have banned Grub client in  
>> robots.txt -
>> this works even on C# client (he reports wiki links as a Not Found,  
>> but
>> this sites can be view by normal browser - this looks like ban for  
>> Grub)
>>
>> Bartek
> They aren't "Not found", but Forbidden (incidentally, it has been done
> at squid level, robots.txt is not mandatory)
>
> GET /wiki/Main_Page HTTP/1.0
> Host: en.wikipedia.org
> User-Agent: Grub WU1
>
> HTTP/1.0 403 Forbidden
> Server: squid/2.6.STABLE18
> Date: Sun, 27 Jan 2008 23:43:30 GMT
> Content-Type: text/html
> Content-Length: 50773
> Expires: Sun, 27 Jan 2008 23:43:30 GMT
> X-Squid-Error: ERR_ACCESS_DENIED 0
> X-Cache: MISS from knsq28.knams.wikimedia.org
> X-Cache-Lookup: NONE from knsq28.knams.wikimedia.org:80
> Via: 1.0 knsq28.knams.wikimedia.org:80 (squid/2.6.STABLE18)
> Connection: close

We've got two things that are getting us on everybody's bad list.   
One: the old grub was very bad and very abusive to systems and there  
are lots of robots.txt (and web admins) that haven't forgotten that.   
So, we have a lot of old karma to overcome.  Two: There are technical  
reasons why the old grub was very bad, it's very hard to get a  
distributed crawler to behave.  Especially when it comes to sites that  
say things like 'don't download any faster then x urls/sec.'  From  
where we sit after we hand out a work unit we have no clue when you'll  
actually fetch that page, making it very hard to obey such rate  
limiting commands by robots.txt.  For now grubng is not giving out  
*any* URLs from sites that have rate limiting statements in their  
robots.txt of less then x URLs/sec (sorry, I don't remember off the  
top of my head what X is, and I can't find jer's script either).

But, in the long run we don't see this as too big of a problem.  We're  
hoping to eventually evolve grub into jer's plans for atlas, this  
means that eventually we see a day where high traffic sites will have  
a way to feed *us* their new and updated pages, or an API where we can  
subscribe to their changes.  Basically, a better way of getting  
updates then trolling the whole web site hoping to stumble over  
something new.

So there Jer, I posted something.

seth



More information about the Grub-dev mailing list