The Why and How of Parallel Programming
Why should we care about Parallel Programming?
That’s an easy one: 2009 saw the introduction of the 8-way chips and 2010 saw their proliferation. Soon enough, 16-way chips will surely follow on servers, although on the desktop Intel will be putting graphics cores on the chip rather than upping the number of standard processor cores.
And all of that is fine and dandy as long as you can use the power that Intel and its competitors are delivering. This year, for example, you can buy yourself a fairly cheap server with 32 cores or even 48 cores. But can you use the cores?
Unless you are running parallel software, the answer is going to be; “err, a little, maybe.”
Sure, you can configure some virtual machines on those cores and that will at least make them execute instructions, but virtualization has its limits. First of all, for every virtual partition you have yet-another-instance-of-an-OS which is inefficient replication/duplication no matter how you try to excuse it. Secondly, the more virtual instances of anything you run on a server the more complicated it becomes to try and recover when the server fails, or even just one part of it fails. And you also have the problem that you need to leave headroom for every virtual partition, to allow for peaks in traffic.
I’m not coming down on virtualization technology here, it’s useful, powerful and necessary. But it has limits.
The reason we should care about parallel processing is that all the chips are multicore now and if we can’t write programs that use them then:
- We’re wasting money
- The chip industry is going in the wrong direction
How To Program in Parallel: Hadoop
It’s a long story, so much so that I wrote a white paper (sponsored by Pervasive Software) on the topic. If you want a copy you can get one here. Pervasive are part of the solution, of course, which is why they sponsored the paper. Amazingly, there are only two (!) primary alternatives for parallel programming:
- Hadoop
- Pervasive’s DataRush
Let’s start by describing Hadoop. Hadoop gets its name from the beloved stuffed elephant of Doug Cutting’s daughter – Doug Cutting being the originator of Hadoop. Technically, Hadoop is a framework for developing applications using Google’s MapReduce technique for dealing with large heaps of data at speed. Think Hadoop, think petabytes.
Hadoop is designed to exploit hundreds or even thousands of servers for a single application. It consists of a fully redundant files system over which sits a fully redundant version of MapReduce, constructed so that if any server fails the program will complete. If you’re wondering why you’d care about a server failing, that’s simple arithmetic. If you have 4 hundred servers with a Mean Time Between Failure (MTBF) of one year, then you’ll get more than one server failure per day, on average. If you spread a program over a large grid of servers, especially one that might run for hours, you need full fault tolerance.
This makes Hadoop a little slower than it otherwise might be, but as it can utilize the power of hundreds of servers, that doesn’t matter too much when it comes to speed. With MapReduce, you’re constrained by the way the Map and Reduce framework operates to follow a specific logical approach, but the approach works fine for most things that involve processing large volumes of similar information – especially log files or click streams.
Hadoop is Open Source. It is a project under the auspices of the Apache Software Foundation.
How To Program in Parallel: DataRush
DataRush and Hadoop are, in the main, complementary rather than competitive. You use Hadoop over large server populations, whereas DataRush is built primarily for the single server with multiple cores – its sweet spot being servers with multiple chips – for example, 4 cpus each with 8 cores. It can scale beyond the single node, so at some point, it overlaps with Hadoop usage. However in most circumstances it’s easy to decide which product is more appropriate.
As indicated in the diagram there are two approaches that can be taken to parallelizing an application:
Data partitioning: This is where you simply split up the data to be processed across the available cpu cores. As long as you get the balance right, with 4 cores it will execute about 4 times as fast, plus you have to allow time for segmenting the data and merging it afterward.
Process partitioning: This is where you split up the process between multiple cores and pass the data output from one process partition to the next process partition. Again, using this approach, with 4 cores it will execute about 4 times as fast.
Hadoop is based on the first of these strategies whereas DataRush can employ either. DataRush provides a framework for planning the workflow according to the available resources. Aside from the fact that it can make applications run at blistering speed, experience has shown that programmers can adjust to this style of programming quickly. They require just a week or two training to become productive. Within the DataRush framework, programming is done in Java.
The surprising fact is that, right now, these are the only two choices available. More alternatives will inevitably emerge.















great article!!! I am in the process of learning about Hadoop! So this piece is very timely. I do agree with your opening paragraphs on the fact that parallel processing means very little if programs are not developed to use it!!
I have looked at Pervasive in the past, and I recently came across another vendor, Zircon Computing, http://www.zircomp.com, at the useR conference in Maryland last week. They appear to offer another framework for parallel programming based on C++ (unlike Pervasive, which is based on Java).
I’m a bit confused by your assertion: “there are only two (!) primary alternatives for parallel programming.” Ada, Erlang, Haskell, and High Performance Fortran have offered easy shared memory and distributed memory parallelism for over 15 years. Libraries like POSIX Threads (pthreads), MPI, SHMEM, and PVM extended parallel processing support to mainstream languages like C, C++ and Java for over 10 years. Java also has their native Java Parallel Processing Framework. Newer languages like Python and R can use NetWorkSpaces. Microsoft introduced native parallelism in their .NET framework this year. There are even languages (e.g., CUDA and OpenCL) that extend parallel processing across multiple processor architectures simultaneously so programmers can leverage embedded processors and GPUs in video cards.
Is your topic limited to large dataset problems using a shared-nothing parallel model implemented using a Map-Reduce technique? If so, then Google MapReduce, Greenplum, GridGain, Hive, Holumbus, Mars, Misco, Plasma MapReduce, Qizmt, Sector/Sphere, and Twister are suitable comparisons.
I have to agree with Daniel, there are FAR more than two ways of parallel programming. There are other approaches to so-called shared-nothing models. Indeed shared-nothing programming models have been around for at least thirty years as have shared-nothing hardware systems.
However Hadoop and Datarush offer models which may, at least for the time being, be an acceptable way of programming of large, multi-core systems for a certain (albeit fairly large and well-heeled) sector of the market. Pervasive’s Datarush doesn’t presuppose much about the underlying architecture at all, just that it can run Java. Hadoop is slightly different but makes little assumption about the underlying hardware. Both are in fact highly viable approaches to the issue of building high throughput, data intensive, commercial applications.
What they aren’t is true programming systems.
The issue with most of the technologies cited by Denis is that they are largely for specialists. They are very popular in a variety of incarnations in areas such as HPC (where we are talking about specialised areas such as automotive and aeronautic engineering, big science and similar areas); where they have largely failed to penetrate is mainstream applications. It is because of the expediencies of the hardware industry that processor architectures have been dominated by shared memory architectures. However most manufacturers are looking at shared-memory and some have developed or acquired substantial expertise in that direction.
For technical reasons it seems unlikely that the shared memory scenario can persist and that ultimately shared-nothing will come to dominate. When that happens environments like the Datarush libraries and Hadoop will still be useful but there will be a need for technologies that will bring current technologies and their derivatives, most likely the latter, to the general purpose user.