Would You Like To Go Fast? Really Fast?
I’m talking about software rather than roller coasters. If roller coasters are what you’re into, I recommend Kingda Ka, which you’ll find at Six Flags in Jackson, New Jersey. As for software, the first thing to know is that parallel processing is the only way to go if lightning fast is what you want.
OK, I admit that if you write in assembler on a zSeries mainframe and you’re really good at nuts and bolts coding then you can build an application that has screeching tires. But in the end, writing in Java and running in parallel will overtake you at some point if you keep on adding CPUs.
Hadoop MapReduce
Google made MapReduce famous and there is a good deal of enthusiasm for MapReduce used in combination with Hadoop. I’ll do a posting in a few days that goes into more detail on how Hadoop MapReduce works, but in outline this is what it does:
- It has a distributed data files system (HDFS) which is built to hold files of similarly structured records. Nothing special about it, except that it’s fault tolerant and can scale to petabytes. Actually that is special, but never mind.
- It can utilize thousands of CPUs in parallel. Many existing implementations actually do. The largest at the moment consists of a grid of about 5000 CPUs.
But can it scream? Well yes indeed, it can. Google invented MapReduce and uses it to search the world wide web. The fact that coherent search results that span billions of web pages can appear in your browser in 0.2 of a second to 0.4 of a second amounts to screamingly fast in my book. Admittedly 1000 CPUs are involved and all the data’s in memory, but still, I challenge anyone to do better. So does Google.
But what if the data comes from disk. Well Hadoop, which is an extension to MapReduce has been benchmarked using the MalStone B-10 Benchmark. It processed 10 billion records (just under a terabyte of data) applying a fairly simple algorithm in 14 hours using nothing more than a 20 node cluster of 4 core CPUs. (Try this link malgen.googlecode.com/files/malstone-TR-09-01.pdf to get a copy of the benchmark). Of course things go slower when you involve those pesky disks – but 14 hours is still fast.
Actually Hadoop may be able to do slightly better than that in terms of resource usage if not time, since it doesn’t yet properly exploit multi-core. However, there is another limitation with Hadoop MapReduce.
MapReduce is a mechanism for achieving massive parallelism by data partitioning (splitting the data up) and sharing the load accordingly. No-one’s sure where the limits to its scalability are, but there’s no obvious bottleneck with thousands of CPUs. If you have a very big heap of similarly structured data, like a fact table in a large data warehouse, say, it’s playing on home territory. It’s good at what it does best.
But if you don’t have that kind of problem, then Hadoop MapReduce is like playing golf with a single club.
So what’s the alternative?
Click to continue reading “Would You Like To Go Fast? Really Fast?”














