Posts Tagged ‘mapreduce’

Google can sort

Saturday, November 22nd, 2008

Google recently announced that they were able to sort 1 terabyte (TB) in 68 seconds using 1,000 computers. The previous record holder was 209 seconds on 910 computers. I was impressed by this because I recently read about MapReduce and have been studying some of Google’s papers about the Google File System. Google used both MapReduce and the Google File System to attain this sorting record. But, being Google, they thought that since they did 1 TB so successfully, why not try sorting 1 petabyte (PB). (A petabyte is a thousand terabytes.) Google was able to sort 1 PB in six hours and two minutes and used 4,000 computers.

Why does Google care about sorting? One reason may be that their primary revenue source is based on advertising. And they have vast access to massive amounts of data submitted by their end users in the form of search queries. The more efficient Google is at crunching this information, the better they can target their advertising to users, resulting in more revenue. And Google can use their data for other purposes too, like predicting flu outbreaks.

I have been very impressed by what I have been reading about MapReduce and the Google File system. These sorting results help prove how efficient their infrastructure is. I particulary like how they use commodity computers to achieve these results. I know that using multiple nodes can get tricky very quickly. But their techniques seem to be designed from the ground up to use multiple nodes. And with this mindset, they can more adequately manage and utilize their collective computing resources.

What I’m reading: locks!

Friday, October 10th, 2008

I have been reading some of the papers published by the Google engineers. It started with Bigtable: A Distributed Storage System for Structured Data. I am not sure how I started. The Official Google Blog posted a link announcing their new technology round series. I watched the “MapReduce” discussion, where the engineers talked about Bigtable and how it is used in MapReduce. This lead me to look for more information about Bigtable as I was looking for information on distributed “communication” techniques to enhance the littles3 implementation. (The current littles3 architecture is very simple and only supports one node. It works, but doesn’t do any cool things like scale storage or be fault tolerant.) I had heard Bigtable discussed in different technical blog settings, but I had no idea that there was a paper from 2 years ago that described the Bigtable system. (I guess I don’t read the technical CS journals like I should. I may have to become more active in IEEE.)

While reading the paper (I did find it very readable. Okay, I am a computer geek. Fair warning.) I noticed that Bigtable, which is a highly scallable distributed database (not relational), used a “lock service” called Chubby. What is a “lock service”? Well, the The Chubby Lock Service for Loosely-Coupled Distributed Systems paper will tell you. I am currently reading this paper. (Again, this is from 2006! Where have I been?) Mike Burrows, the author of The Chubby Lock Service for Loosely-Coupled Distributed Systems, sprinkles humor into a computer science paper discussing Paxos, “a family of protocols for solving consensus in a network of unreliable processors”. What I found interesting is how the “lock service” is used to share information in a highly distributed system. The Bigtable implementation is a client of the “lock service” and uses it to elect a leader; the leader is the node that aquires the lock–only one node will get the lock. The “lock service” can also store small amounts of information, like metadata or configuration information, that a client application can read from the “lock service”.

Next up is the paper Paxos Made Live – An Engineering Perspective. This paper provides some details on how the Google team implemented Chubby, some of the history of the previous implementation, and some of the issues that they discovered implementation the Paxos algorithm.

Together, these papers provide some details of how Google has implemented highly distributed systems. So far, the information about Paxos has been very enlightening. And I am impressed with the way in which a “lock service” is used to coordinate communication and direct cooperation in a automated distributed network. It seems that they have created simple building blocks that together work in sometimes unique ways to make a complex system.