Greenplum: Does Big Database Matter?
I remember when Teradata sounded like a very futuristic name for a database company (would any company ever store a whole Terabyte?). Not any more. I went to the Apple Store last week and bought a 1 tb Time Capsule. I’ve got over half a terabyte of data straddled across 3 computers, much of it duplicated, so a terabyte will be fine for a while. In a year or two laptops that still have spinning disks will probably come with a terabyte of storage.
Moore’s Law provoked cpu power into doubling every 18 months, which meant that computer power multiplied by a factor of 10 about every 6 years. There was a kind of parallel unnamed database law that marched in step with that – except that the factor was, remarkably, 1000 not 10. It went like this:
- In 1990 megabyte DBMS were manageable
- In 1996 gigabyte DMBS became manageable
- In 2002 terabyte DBMS became managable
- In 2008 petabyte DBMS become manageable
Greenplum, who briefed me recently on their database product, fits this pattern and has been brought to market right on cue. Greenplum has been architected for the petabyte world and can load data at a speed above 4.5 terabytes per hour – which is neat given that it doesn’t just write the data to disk, but actually organizes it to enable fast queries. Of course, it employs parallelism to do it, but if you’ve ever dug deep into database performance you know there’s a limited number of techniques you can use to tame “big data” and Greenplum uses them all; parallelism indeed, but also bit-mapped indexes, sophisticated caching, statistical optimization and so on.
One of the reasons Greenplum sparks my interest, is because it is based on PostgreSQL, a database that has carved out a unique position in the open source database world. If you weren’t aware Postgres was the given-away-free-to-the-world by Berkeley University forerunner of Michael Stonebraker’s Illustra which was acquired by Informix, which was in turn acquired by IBM. As relationally structured database engines go, it’s versatile. It was used by EnterpriseDB to create an enterprise-capable Oracle clone and it is used by Greenplum to build a massively parallel engine that can have analytical functions embedded close to the data. It’s a proven and dependable engine.
If you’re wondering which companies roll up such huge amounts of data that they need a petabyte scale engine, then think in terms of dot coms and telcos – because they do. That’s where many of the big databases grow nowadays, assuming you exclude the ones that are crammed with music, photos or video. And it also turns out that it’s worth analyzing the data (billions of calls and clicks in the main) rather than deleting it.
If you’re wondering which products Greenplum competes with, it’s Teradata, Netezza, NeoView (from HP), DATAllegro and one or two others. The fact that there are so many vendors in this space tells you that the need for big BI databases is still as real as it ever was. The petabyte databases are out there.




















Parallelism on big data is definitely the trend. I actually think big players like Teradata and Neoview have antiquated architectures and the Greenplum model is the wave of the future.
Datallegro, Paraccel, Vertica, and newcomer Aster Data are other scalable databases to watch. I wonder how Oracle will adapt to this brave new world…
To answer the Oracle question, it will surely buy one of the players. IBM too, I guess.