Why Should We Care About Parallel Processing?
Are you kidding me? What kind of question is that?
Why We Should Care
Moore’s Law used to work via the combination of 2 mechanisms:
- Increasing the clock rate of the chip
- Reducing the size of the circuit on the chip, thus allowing more transistors to be added to the chip
In about 2004/2005 the first mechanism blew. Chips speed at the level of 4 to 5 gigahertz presented a severe barrier because they ran too hot (and consumed a good deal of electricity in the process). In fact they were too hot for mobile devices because they would suck a battery bone dry and cause the device to be uncomfortably hot in the process. Chip engineers are still working at the edge of this barrier, trying to push it back a little, but nobody expects more than a further 20%-30% improvement in clock speed.
This means that, for the foreseeable future, chip improvement will come through further miniaturization. This in turn means that chip manufacturers (Intel, AMD, IBM, et al) will simply add more processor cores to the chips as the years go by. They have to do something to make previous chips obsolescent or else the market for chips will stagnate. That’s the way the biscuit breaks.
Multicore Multiplication
So the cores on the chip are multiplying like amoebas. Look there’s one core. No two! No four! Jeeze, now there’s eight! …. And the perception is that Moore’s Law continues to deliver its “twofor” deal as time goes by. Except there is a problem here. We already got burned once when we examined our data centers and discovered that we were running servers at 6%-utilization-and-points-south. In fact, that made VMware what it is today – a company that gets you back the cpu cycles you couldn’t be bothered to use.
With multicore we will get burned again unless we use the power that the multiple cores deliver. Of course we can get access to some of this cpu power via virtualization. But virtualization is not actually the best way to exploit the power. Parallel processing is. We need to write programs that use all the cores at once
Nobody Taught Me to Write Parallel Programs
Most programming languages are designed to specify a serial flow of logic and, to be honest, as human beings we tend to think in serial logical flow. “Add the water to the flour, yeast and salt. Add sugar if you’re an American. Kneed it into dough. Let it rest for a while. Divide it into loaves. Put it in the oven. Bake it.” And in the process of making bread you have to do the steps in that order. Do it in the wrong order and you wont end up with bread.
Many flows of logic are like that, but quite a few are not. In programming, when we come across a repeatable process we “spin a loop.” For example, read through the people-table and find all people named “Jim.” Well we don’t have to iterate, we could just as easily process it in parallel. That’s a small example – a nano example if you like – but there are many large examples. Most of Business Intelligence, from data cleansing to data mining, is an iterative selection, followed by an iterative process, ending in a list of some kind (or even a graph of some kind). It has the potential to be very parallel.
So here’s what I’m expecting to happen…
The amoeba-like multiplication of cores will enable much faster BI through the use of parallel programming.
It has the potential to change the way that we architect systems, because right now, BI and data integration is a kind of serial process. From ETL to Data Warehouse to Data Mart to BI App.
This should be a parallel process, not a serial one.















I agree, and you are right to emphasize parallel programming and not just parallelism.
First, the processor manufacturers have done little to dispel the widely-held misunderstanding that multi-core automatically brings the benefits of parallelism. It does for certain low-level operating system functions, but not for the procedural program logic of things like list processing or log parsing. For that sort of thing, new programming is required.
Second, Oracle, DB2 and other database systems are themselves “parallelized” such that indigenous operations occurring entirely within the database automatically gain performance benefits as a result. But, in areas like ETL, data warehousing and mining, data integration, and the like, the data sources and sinks are seldom of a single brand or in a single system. To get the benefits of parallelism in these contexts, new programming is required.
But, parallel programming has been in itself a big challenge. Doing parallel programming with raw coding in standard languages is not easy to do. Elusive bugs, race conditions and other dire events are very hard to avoid. Alternatively, doing parallel programming with unfamiliar, specialized languages requires a steep learning curve and brings unique systems integration challenges.
It seems that one your “10 companies to watch in 2010″, Pervasive Software is running this problem to ground with their DataRush product, which provides a highly “data-aware” parallel Java framework and run-time engine for high performance ETL, data mining and predictive analytics. Their benchmarks are smokin’ and it looks like it is pretty easy to program. Worth a look for anyone with a need for speed.
I agree, parallel processing has been hidden by the processor vendors up to now. But the real issues are coming – if the speed ratchet is to continue then the only way that it will happen is by increasing the number of cores on a die and possibly building 3D stacks too. That will have the effect of changing the environment in which programs run beyond recognition.
There are only two ways as far as I can see that parallelism* will actually work: firstly that **everyone** learns to program in parallel, or that there is some way of “hiding” the hardware from the user. A sort of “virtualisation” (of the virtual machine kind) but much more advanced and much more sophisticated. BUT while the former is a unlikely to happen for any number of reasons, even if we go down the latter route in order to be able to use multicores properly we are going to have to change our mindsets and learn to think in ways that are new.
Of course there will tools to help but that won’t obviate the need to change our approach to how we think about programming. Pervasive Datarush is a tool that has already shown what parallelism can do for you just by thinking about the program and restructuring from a conceptual point of view. It also goes a long way towards doing some virtualisation, although not really in the sense that I meant in the last para. The speed ups that it has delivered already are remarkable (thousand-fold and better), but these are in part because of the nature of the problems that they have addressed.
I think that what you are looking for Tim is tools. They do exist, to some extent Datarush is one such, but at present toolkits are only nibbling at the edge. DBs are naturally parallel to some degree or another, to be more precise DB usage is largely parallel and the apps that sit on top can be replicated to run on multiple cores; a large class of other problems are more difficult to parallelise. The big issue comes with people’s assumption that “Oh those processor guys will give me a solution to handle parallelism”. WRONG. So far we have been looking at the problem for well over thirty years and no-one has come up with an approach that will lead to high quality code. It is one of the big problems in computing. There are people who claim to be able to unroll any loop and substitute parallel code, but in even where they do the code that they come up with is far from optimal in one sense or another.
Thinking differently requires different frameworks to work in, so I disagree about languages, I think that the provision of new languages may be a component of any true transition to parallelism.
I very much agree with you both about BI (writ broad). The interesting thing about parallel this time around is that last time BI wasn’t something that really existed in say the 1980s on the scale that we have it now, far less predictive analytics and datamining. The DB companies are principally responsible for that.
Now here is a conundrum: could it be that the pressure from real commercial applications such as BI will this time round turn out to be the driver of tools and technologies for a parallel revolution. The market is clearly much larger than any meaningfully addressed by parallelism before?
*NOTE: Before anyone says “What about Cloud?”, this isn’t the same as cloud computing. This isn’t “the same sort” of parallelism. This is on-chip parallelism where we are talking on-chip access and on-chip comms between hundreds of cores. The issues are different.
Peter, I am with you in the belief that for parallel *computing* to deliver on its theoretical promises, we must all, to borrow an old Apple ad phrase, “think different”. And, as I love “nosebleed” programming languages as much as anybody – the first one I learned was APL and the second was LISP, I cannot disagree that different thinking must be accompanied by different programming notions and I am very happy to lately see some renewed vigor in parallel programming research at Carnegie-Mellon, IBM and elsewhere. There are many categories of computing challenges, including advanced mathematics, encryption, and others, that will require all you prescribe and more to benefit from parallelism.
You are also quite right to point out that I was indeed talking about *applied* parallelism and particularly in the realm of ETL, Data Mining and Predictive Analytics. These are problems where even an admittedly primordial tools-oriented approach like DataRush that gets good results should be considered a great leap forward for parallelism at large, for a few reasons. It gets a large number of commercial programmers, vendors, analysts, and journalists learning and thinking about parallel computing. It gets a large number of companies spending money on parallel computing solutions. And, it renews everybody’s resolve to address a broader range of applications and to seek more fundamental technical solutions, i.e. parallel languages and algorithms. If, in the process, Pervasive bloodies the noses of their larger, less innovative rivals, all the better for its bottom line and for the satisfaction of those of us who root for the underdog.