Friday, 9 December 2011

PostgreSQL - enums vs varchar

Work-related topic today. OVirt stores java enum values as integer in database. Java 1.5 enums are kind of great tools in the language, but if you accidentaly re-order them, the ordinal changes, which is tragic for the records persisted in the database.

One solution is that you add a final integer field to your enum, set it from a private constructor, add a getter, and use it in your database persistence code. Ah yes, this lots of code to be written and maintained... and basically we are just lazy, just like anyone else. [Anyway, this solution is actually used by Ovirt today.]

Another possible solution that for example JPA supports is that you persist your enum as varchar, which is great, but varchars are bigger than integers. In this case it is kind of difficult to rename anything in the enum, names are more human-friendly than numbers, so you may not want to do this very often.

So lots of guys came up with the question: why not database enums. Enums are a very unique solution in PostgreSQL. [Relational databases are full of unique solutions, but fortunately not as much as NoSQL-databases, at least they somewhat agree in SQL syntax :-) ] PostgreSQL has enum. Enum is cool, it takes only 4 bytes, and can be read and written as character data, it can  also save your application from invalid data, e.g. a 'rde' inserted accidentally as color;
When extending an enum, you can add or remove new elements with 'ALTER TYPE', however, this command is a new feature in the latest and greatest [therefore not that much widespread] version 9.1, it is not available in PostgreSQL 9.0. [This could be a problem for OVirt, since it targets PostgreSQL 8.4]

Ok, enough talking, let's see the data.

Insert performance

As you can see, with enums you can win a nice 10-15% on performance, but this is an edge-case. In this benchmark, I inserted 100.000 records in a single transaction. This is a very rare in applications.
When using multiple transactions, the difference dissappears, since the transaction overhead is significant.

Disk use

Sure, this is why we have come to enum. There is no significant difference in the index size since varchars are not stored in the index. However, the small storage requirement is still great, the database server can cache more useful information in the RAM, which results in faster responses if the database size is big. I mean, bigger than your RAM. [Not sure if this is a significant advantage, oVirt does not need a huge database]

Update performance

Again, enum beats varchar with roughly 10 percent.  [This is a relatively small win for OVirt, but e.g. updating statuses on Virtual Machines is a very frequent operation.]


Select is another very frequent operation. As mentioned, enums takes less space, more data can fit in the memory.

Let's see some query plans, the first one is for the varchar column that has an index on it.

temp=# EXPLAIN ANALYZE select * from vi where birdtype = 'cuckoo';
                                                        QUERY PLAN                                                        
 Bitmap Heap Scan on vi  (cost=7.44..22.59 rows=412 width=10) (actual time=0.032..0.032 rows=0 loops=1)
   Recheck Cond: ((birdtype)::text = 'cuckoo'::text)
   ->  Bitmap Index Scan on vi_birdtype_idx  (cost=0.00..7.34 rows=412 width=0) (actual time=0.029..0.029 rows=0 loops=1)
         Index Cond: ((birdtype)::text = 'cuckoo'::text)
 Total runtime: 1.790 ms
(5 rows)

And then the same with the enum column.

temp=# EXPLAIN ANALYZE select * from vi where birdtype = 'eagle';
                                                            QUERY PLAN                                                            
 Bitmap Heap Scan on vi  (cost=707.02..1998.61 rows=19967 width=10) (actual time=2.465..7.674 rows=20000 loops=1)
   Recheck Cond: ((birdtype)::text = 'eagle'::text)
   ->  Bitmap Index Scan on vi_birdtype_idx  (cost=0.00..702.03 rows=19967 width=0) (actual time=2.378..2.378 rows=20000 loops=1)
         Index Cond: ((birdtype)::text = 'eagle'::text)
 Total runtime: 8.738 ms
(5 rows)

No significant difference, because of the same index size / efficency.

Saturday, 3 December 2011

Cassandra on a pile of wreck

NoSQL databases traded ACID compliance for speed and scalablity and I was very interested in how good deal is that for mere mortals like me.

I am experimenting with Cassandra for a while. Version 1.0 was released recently (probably a month ago) and bugfix releases followed quickly, so now the current release is 1.0.5.

A special challange is the server farm I have: no uniform servers. Actually, you could call them a pile of wreck, they would not fit even desktop-class, but that does not matter. There is a trick to ballance the servers right.

  • dummywarhead - this is first cassandra node, the seed of the cluster running Fedora 15, it's hardware is a 512 GB sata hard disk, 2 GB DDR3 RAM and a dual core AMD CPU.
  • maroon - the only branded server in my the farm, an IBM XSeries 226 with hyper-threaded 3Ghz Xeon processor 1 GB DDR2 RAM and an old 250 GB sata hard drive. Runs ubuntu. (Yeah, I was lazy and I have just left it running ubuntu, it does not really matter from performance point of view)
    I really like the IBM casis, but the server generates terrible noise when under load.
    As it turned out, I can expect 50% of dummywarhead's CPU performance from this server, and exactly half as much memory.
  • Switch: TP-LINK 5-port gigabit switch - this should not be a bottleneck in the system. Cat-6 cables, gigabit network interfaces installed in all of the nodes.
  • Load generator: Lenovo T520 Tankpad: 4 GB ram, 4 cores. (Red Hat work laptop)
My test is very simple: In each iteration I write 100.000 records to Cassandra, and then I read 100.000 random records from all the data I have inserted so far. I am trying to avoid tricks with the setup, but I had to do some in order to get some nice results.
  • Since I do not do updates, the repair on read chance is set to zero.
  • I abandoned cassandra's thrift client after the first few tests and moved to hector. Cassandra's client seemed to be unable to deal with dead connections, database topology discovery, etc. Hector does not have any problem with all these.
    Cassandra's client also seem to have_a_problem with javaNamingConventions, it was a bit confusing.
  • It seems that Cassandra (up to 1.0.5) assumes that hosts are unique, so when a new node joins the network, it ballances the database for equal load automatically. In order to get a good result out of the test, I had to update the tokens before starting the loadtest. The logic is quite simple, you just have to distribute the keys between 0 and 2^127, this will determine the load on the servers. Since I assumed that dummywarhead can do twice as much load as maroon, I generated token 56713727820156410577229101238628035242 for dumywarhead and 113427455640312821154458202477256070484 for maroon. This will put 66% of the load on the first, and 33% of the load on the second node.

On this chart you see a single node test. The first thing you notice is that the read times (red) get a little slower until the half of the test, it goes unstable and at the end of the test it is 10 times slower than at the half of the test. This is because of the amount of memory I have in the test server. When it ran out of the memory, in each round it had to find something on disk more and more often.
The second interesting thing here is that the writes are so nice and stable.

On this second test I have let cassandra ballance the database the way she wanted. I started up both dummywarhead and maroon, and started the test after maroon successfuly joined the ring.
As you can see, in this case the collapse of read operations happened much later and did not reach the 200 second limit that the previous single node test did. However, this it is far from being half as much...
While watching the load on the servers, maroon started to do heavy read operations soon since it ran out of RAM. The other server, dummywarhead did not start read operations at all until the end of the test.
I made the first two charts so that it is easy for you to compare the two. The third chart is still belong to the second experiment. I could not resist to do some modifications and I used the nodetool move command to re-allign the load according to the real capabilities of the servers. So this is the same test, continued.

What you see here is that the response time got crazy long for a while when I started to re-allign the cluster, it is a huge peak. And then, when the operations finished, the response time dropped again. The sharp-eyed may notice that the write operations also got 20% faster after re-allign.

So far so good... but.

Joining nodes to the cluster under load takes a lot of time. Really a lot, even with small databases. I am sure one more server to a 100-node cluster is no problem, but adding a second node to a single node really kicks. And then it reballances the cluster to the (bad) defaults, which drops performance. So at the moment runing cassandra on non-uniform nodes seems to be difficult. Plus, scaling out did not immediately made things better, it dropped performance for a while.
The Cassandra cookbook says that the move operation actually removes the node and re-adds it to the cluster with the new token. This may have changed since Cassandra 0.5, it is not that heavy operation, as far as I can tell, it did not remove the node, it only moved it to 'Moving' state for a while.

One can play a lot with this software, it has some interesting concepts.