Archive for November, 2008
So, I promised some documentation “soon” for littleS3. That was 2 months ago. Well, I have finally made good. I have just published a “Getting Started” wiki page to the project site. So far, this document provides some background on the project components, how to deploy it to an application server, and what the configuration files “configure” (along with sample configuration files in the project download section).
I would still like to add some samples of how to use the system to create buckets, add objects, etc. This is very similar to the usage described in the Amazon S3 Developer Guide for the REST API, but there is a bit of a trick since you are using your own application server. In addition to the host name, you may need to include a context path (a servlet notion) to the REST URIs.
Google recently announced that they were able to sort 1 terabyte (TB) in 68 seconds using 1,000 computers. The previous record holder was 209 seconds on 910 computers. I was impressed by this because I recently read about MapReduce and have been studying some of Google’s papers about the Google File System. Google used both MapReduce and the Google File System to attain this sorting record. But, being Google, they thought that since they did 1 TB so successfully, why not try sorting 1 petabyte (PB). (A petabyte is a thousand terabytes.) Google was able to sort 1 PB in six hours and two minutes and used 4,000 computers.
Why does Google care about sorting? One reason may be that their primary revenue source is based on advertising. And they have vast access to massive amounts of data submitted by their end users in the form of search queries. The more efficient Google is at crunching this information, the better they can target their advertising to users, resulting in more revenue. And Google can use their data for other purposes too, like predicting flu outbreaks.
I have been very impressed by what I have been reading about MapReduce and the Google File system. These sorting results help prove how efficient their infrastructure is. I particulary like how they use commodity computers to achieve these results. I know that using multiple nodes can get tricky very quickly. But their techniques seem to be designed from the ground up to use multiple nodes. And with this mindset, they can more adequately manage and utilize their collective computing resources.
During the day, I professionally work on a web application called Pearson Access. Today we determined that we needed a user administration process for one of our Pearson Access customers that would require them to always select a certain user role when creating a new user. The system doesn’t auto-check the role, the user will need to be trained to check it.
But, being a geek, that got me thinking. What if Greasemonkey could be used with a Greasemonkey script in Firefox to automatically check a role when creating a new user account. So, I wrote such a script: autoselectrole.user.js.