[Grub-dev] C# Grubng 0.1 and few questions
seth
seth at untethered.org
Mon Jan 28 02:15:01 UTC 2008
On Jan 27, 2008, at 5:47 PM, Balinny wrote:
> It seems everything with gz extension is accepted on /arc/
>
> $ putarchive hotbits soap.grub.org /arcs/FooBar.gz
> HTTP/1.1 200
> Date: Sun, 27 Jan 2008 22:46:46 GMT
> Server: Apache/1.3.28 (Unix)
> Connection: close
> Content-Type: text/html
>
> <HEAD><TITLE>OK</TITLE></HEAD><H1>Content Accepted</H1>
>
>
> (hotbits file are 2048 random bytes from
> http://fourmilab.ch/hotbits/generate.html)
>
> However, i'm unable to upload
> Balinny.XXXXdde0b5c6490a7a229ae4c7516d962933be64.arc.gz :S
>
> The file is uploaded, i can download it and it's ok, but the server
> never answers, so the uploader stales waiting for the reply.
> Behaviour is consistent with that content (which seems corrupted, like
> working only /sometimes/, i'll need to recheck that).
>
Jer has been prodding me to comment on what's going on with the grub
back end work all week. I kept pushing back saying I wanted something
to 'show' before I said anything. But...given recent comments, here's
some insight into what's going on.
The first script that accepted .arc.gz files from users was written
very quickly. All the script did was store these files to disk; later
we were going to parse out the content, verify hashes, stuff like
that. So really all a 200 response from PUTing told you was: our
disks worked :)
This week I started tinkering with the script that accepts the files.
Now after it save your arc file to disk it does open it and attempts
to separate out each URL and insert them into an HBase table. If you
were really paying attention this week you would notice that the time
it takes to post a file has been steadily increasing (now should take
~17-25 seconds) as this 2nd part of the script attempts to take apart
your file and insert it as rows into hbase.
We are still storing the original file that was submitted to us.
HBase is turning out to be a bit...less stable then we'd like at
times. More then once I've done simple things like 'truncate table'
or 'alter table add column' and had the entire table disappear on
me...for no explainable reason. Once I even had the 'meta' table
become corrupt and I had to basically reset the entire database
(loosing all tables). And a good example would be right now when
apparently in the last day or so one of the region servers stopped
working correctly and the master didn't attempt to relocate it. So,
any interaction with rows that belonged to that region simply hung.
Some research into this error and I think I can keep this one from
happening again.
One thing I really need to clear up is the responses this script
gives. As I said, the first thing the script does is *save* your
file, then it attempts to load it into hbase. So, if you transmitted
all the bytes, we likely have the file and got stalled in our
interaction with hbase. Right now if you get a 500 error from the
server *after* transmitting the file, I can almost assure you it's a
problem with the hbase bits and that we did get and store the file.
Hopefully in the next week you'll start seeing some kind of document
that is kicked back that indicates how many of your URLs were accepted
into hbase. I also hope to be adding some code to make injection of
data a little more difficult (basically, rate limiting how many times
in one day we can see the same document come in...and verifying the
hash). I also want to get a script up that'll let you type in a URL
and see the cached versions we have for that URL. But, all of these
actions require that I add some missing functions to HBase's REST API,
and I've been avoiding doing that all weekend :).
For now though if you're getting 500 errors or the upload hangs you
can know who to blame. Feel free to ping me in IRC if you really want
(I go by untethered), and curse the systems administrator that decided
to avoid systems admin work by writing code.
>
>> And one thing - maybe funny, maybe weird - but wiki projects
>> (wikipedia,
>> wikinews, wiktionary, wikibooks) have banned Grub client in
>> robots.txt -
>> this works even on C# client (he reports wiki links as a Not Found,
>> but
>> this sites can be view by normal browser - this looks like ban for
>> Grub)
>>
>> Bartek
> They aren't "Not found", but Forbidden (incidentally, it has been done
> at squid level, robots.txt is not mandatory)
>
> GET /wiki/Main_Page HTTP/1.0
> Host: en.wikipedia.org
> User-Agent: Grub WU1
>
> HTTP/1.0 403 Forbidden
> Server: squid/2.6.STABLE18
> Date: Sun, 27 Jan 2008 23:43:30 GMT
> Content-Type: text/html
> Content-Length: 50773
> Expires: Sun, 27 Jan 2008 23:43:30 GMT
> X-Squid-Error: ERR_ACCESS_DENIED 0
> X-Cache: MISS from knsq28.knams.wikimedia.org
> X-Cache-Lookup: NONE from knsq28.knams.wikimedia.org:80
> Via: 1.0 knsq28.knams.wikimedia.org:80 (squid/2.6.STABLE18)
> Connection: close
We've got two things that are getting us on everybody's bad list.
One: the old grub was very bad and very abusive to systems and there
are lots of robots.txt (and web admins) that haven't forgotten that.
So, we have a lot of old karma to overcome. Two: There are technical
reasons why the old grub was very bad, it's very hard to get a
distributed crawler to behave. Especially when it comes to sites that
say things like 'don't download any faster then x urls/sec.' From
where we sit after we hand out a work unit we have no clue when you'll
actually fetch that page, making it very hard to obey such rate
limiting commands by robots.txt. For now grubng is not giving out
*any* URLs from sites that have rate limiting statements in their
robots.txt of less then x URLs/sec (sorry, I don't remember off the
top of my head what X is, and I can't find jer's script either).
But, in the long run we don't see this as too big of a problem. We're
hoping to eventually evolve grub into jer's plans for atlas, this
means that eventually we see a day where high traffic sites will have
a way to feed *us* their new and updated pages, or an API where we can
subscribe to their changes. Basically, a better way of getting
updates then trolling the whole web site hoping to stumble over
something new.
So there Jer, I posted something.
seth
More information about the Grub-dev
mailing list