Vertica: Revolution in the Database World?

As an analyst, I cut my teeth writing about database. In the 1990s the database market was hugely active with a mass of products; Adabas, Allbase,  Datacomm DB, DB2, Empress, IDMS,   Informix, Ingres, Interbase, Oracle, Progress, RDB, Sybase, Unify, and quite a few more. Object databases were introduced in 1993 or so, and then the object relational database emerged. Microsoft got into the game by licensing the Sybase code. Database benchmarking contests consumed column inches of IT news and then Very Large Database (VLDB) became a focus and Teradata, Red Brick and Tandem Non-stop SQL got air-time.

It was a very active market and then, all of a sudden, it wasn’t. Some products atrophied, companies ate other companies and eventually the marketing contest was decided. Three very big players: Oracle, IBM (DB2) and Microsoft (MS SQL Server) became dominant and most other products scratched out a living if they could. That’s what happens with lively software markets, eventually they settle down with a few large vendors sharing the spoils. The only development that disturbed the comfortable hegemony of the big three was the gradual success of MySQL, which was Open Source and optimized for serving web pages. It dominated a small market that soon became a big market.

Vertica and Verticality

If you put specialist database applications to one side, you can view the database world as consisting of 3 big markets:

  1. Commodity Database: For undemanding database applications pretty much any viable rdbms will do. May as well stick with Oracle, DB2 or SQL Server, whichever your company uses most.
  2. Large OLTP Databases: These are for very heavy transaction workloads. The big 3 are used here, but also some older products, IDMS and Datacomm for example, run big OLTP systems.
  3. Data Warehouse: These databases hold masses of data (in the terabyte area). Oracle, DB2 and SQL Server are used in such applications but so are more specialist products including Teradata, Green Plum, NeoView, Netezza and one or two others.

As far as Data Warehouse is concerned, there is now a new kid in town, that goes by the name of Vertica. The team that created Vertica is led by the man who can claim to be the most prolific database author in history, Prof Michael Stonebraker. The list of databases he has worked on  includes; Ingres, Postgres, Illustra, Cohera, StreamBase and now Vertica. When something new comes from the Stonebraker stable it is usually worth looking at – and Vertica is definitely worth looking at.

The simplest way to explain why Vertica is compelling is to describe the thinking behind it.

  • First of all, the vast majority of data warehouses are built using a simple snowflake schema. The important point about such a database is that the vast majority of the data (95% +) is held in a single “fact table”. The fact table will hold the normal transactions of the business. For banking that would be account transactions, for a telco it’s telephone calls, for a retailer its individual items purchased and so on. Nearly all the transactions run against the data warehosue will involve pulling data from the fact table.
  • Vertica is a database that focuses almost entirely on pulling data from very very big tables. It optimizes queries that do precisely that. It does so by organizing the data into columns and processing on a column by column basis. I could explain why this is dramatically faster than the way relational databases typically process database tables, but there is not the space here. Take my word for it, it flies like a bullet from a gun. Depending on circumstances, it can be 100 times faster.
  • There are products like those I mentioned (Teradata, NeoView, etc.) that do the data warehouse job faster than Oracle/DB2/SQL Server, but not 100 times faster. 100 times faster is “two orders of magnitude” and that is huge. It means that a query that previously took 3 hours, now takes about 100 seconds. For data warehouse that’s game changing.

Vertica Characteristics

In actual fact, there’s nothing that Vertica is doing that is new in terms of database techniques. It uses column processing (not new), it compresses columns (not new), it is highly parallel (not new), it is optimized for a particular workload (not new), it’s shared nothing (not new), it works on commodity hardware (not new), it is entirely self-tuning (not new). However, when you put all of this not-newness together you get data warehouse characteristics that are distinctly new. Here’s a list:

  1. Requires no DBAs
  2. Self scaling (to scale up, it kinda scales out)
  3. Always on (loading is concurrent with processing)
  4. Near real-time data (indeed so, if compared to anything else)
  5. Configurable to be fully fault tolerant
  6. Very much faster
  7. Considerably reduced cost
  8. Can work in the cloud

The last of these characteristics is doubly worthy of comment and it is the most exraordinary aspect of Vertica. I have to confess that, although I eventually expected data warehouse to be offered as a cloud-based capability, I never expected it any time soon. Vertica is partnering with Amazon to deliver it now – at a price of $2000 per month for half a terabyte of data.

If you know anything about the costs of a data warehouse then you will think I’ve misquoted the price, but I haven’t. You really can commence building a highly scalable always available data warehouse and have it up and running in a day or two (if you’ve got the data ready) for a few thousand dollars.

If that isn’t revolutionary, then what is?

  1. July 7th, 2008 at 08:55 | #1

    Hi, I work for Vertica, and here are a few comments on the comments…

    Regarding the “No DBA” claim…Vertica’s architecture greatly reduces the time DBAs need to spend getting it running and keeping it fast (via all sorts of automatic installation, tuning, physical design, replication, failover, recovery features as Robin describes). DBAs ARE still required (as LewisC mentions) to focus on data security, backup and integrating new data into the database. Vertica spares DBAs from the design, tuning and recovery tedium so they can focus on higher value added work.

    Regarding Niko’s orders of magnitude math…Good catch! There are examples of >2 orders of magnitude performance out there. One customer, NetworkIP (a US-based prepaid telecom services provider) described (in a webinar) a query that formerly took 12 hours to run and now takes 10 seconds…depending on the workload, customers will find these exceptional boosts in query performance (for a given query), but for the overall query workload (averaged across tens of different queries a given DBMS is handling), the query performance increase typically ranges from 50x to 200x relative to a row-oriented DBMS.

    Lastly, regarding Adi’s comment about whether a column DBMS performs as fast when processing more than just a few columns…A column database reads just the columns referenced by the query, which reduces disk IO and speeds up query handling (versus an index-less row db, which must scan the whole table). But if the query is referencing every column in the table (a rare query for most data warehouses), this benefit was negated in early column dbms implementations. The Vertica architecture has much more than column-orientation going for it…data compression (typically up to 90%, and Vertica can query data in its compressed form) further reduces disk IO, and as Robin mentioned in his comment, it’s shared-nothing architecture and automatic data replication spreads the query and loading work across multiple servers working in parallel. So even when reading a relatively large % of the table, Vertica performance stays fast.

    As LewisC mentioned, the Vertica Architecture white paper (found on the vertica home page) describes how all this works in detail.

    This is an interesting thread–thanks all.

Comment pages
1 2 1024
  1. July 23rd, 2008 at 07:57 | #1
  2. September 12th, 2008 at 13:42 | #2
  3. September 25th, 2008 at 14:35 | #3