Friday, 9 December 2011

PostgreSQL - enums vs varchar

Work-related topic today. OVirt stores java enum values as integer in database. Java 1.5 enums are kind of great tools in the language, but if you accidentaly re-order them, the ordinal changes, which is tragic for the records persisted in the database.

One solution is that you add a final integer field to your enum, set it from a private constructor, add a getter, and use it in your database persistence code. Ah yes, this lots of code to be written and maintained... and basically we are just lazy, just like anyone else. [Anyway, this solution is actually used by Ovirt today.]

Another possible solution that for example JPA supports is that you persist your enum as varchar, which is great, but varchars are bigger than integers. In this case it is kind of difficult to rename anything in the enum, names are more human-friendly than numbers, so you may not want to do this very often.

So lots of guys came up with the question: why not database enums. Enums are a very unique solution in PostgreSQL. [Relational databases are full of unique solutions, but fortunately not as much as NoSQL-databases, at least they somewhat agree in SQL syntax :-) ] PostgreSQL has enum. Enum is cool, it takes only 4 bytes, and can be read and written as character data, it can  also save your application from invalid data, e.g. a 'rde' inserted accidentally as color;
When extending an enum, you can add or remove new elements with 'ALTER TYPE', however, this command is a new feature in the latest and greatest [therefore not that much widespread] version 9.1, it is not available in PostgreSQL 9.0. [This could be a problem for OVirt, since it targets PostgreSQL 8.4]

Ok, enough talking, let's see the data.

Insert performance


As you can see, with enums you can win a nice 10-15% on performance, but this is an edge-case. In this benchmark, I inserted 100.000 records in a single transaction. This is a very rare in applications.
When using multiple transactions, the difference dissappears, since the transaction overhead is significant.

Disk use



Sure, this is why we have come to enum. There is no significant difference in the index size since varchars are not stored in the index. However, the small storage requirement is still great, the database server can cache more useful information in the RAM, which results in faster responses if the database size is big. I mean, bigger than your RAM. [Not sure if this is a significant advantage, oVirt does not need a huge database]

Update performance



Again, enum beats varchar with roughly 10 percent.  [This is a relatively small win for OVirt, but e.g. updating statuses on Virtual Machines is a very frequent operation.]

Select

Select is another very frequent operation. As mentioned, enums takes less space, more data can fit in the memory.

Let's see some query plans, the first one is for the varchar column that has an index on it.

temp=# EXPLAIN ANALYZE select * from vi where birdtype = 'cuckoo';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on vi  (cost=7.44..22.59 rows=412 width=10) (actual time=0.032..0.032 rows=0 loops=1)
   Recheck Cond: ((birdtype)::text = 'cuckoo'::text)
   ->  Bitmap Index Scan on vi_birdtype_idx  (cost=0.00..7.34 rows=412 width=0) (actual time=0.029..0.029 rows=0 loops=1)
         Index Cond: ((birdtype)::text = 'cuckoo'::text)
 Total runtime: 1.790 ms
(5 rows)


And then the same with the enum column.




temp=# EXPLAIN ANALYZE select * from vi where birdtype = 'eagle';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on vi  (cost=707.02..1998.61 rows=19967 width=10) (actual time=2.465..7.674 rows=20000 loops=1)
   Recheck Cond: ((birdtype)::text = 'eagle'::text)
   ->  Bitmap Index Scan on vi_birdtype_idx  (cost=0.00..702.03 rows=19967 width=0) (actual time=2.378..2.378 rows=20000 loops=1)
         Index Cond: ((birdtype)::text = 'eagle'::text)
 Total runtime: 8.738 ms
(5 rows)



No significant difference, because of the same index size / efficency.

2 comments:

  1. Nice explanation! I'll use enum from now on.

    ReplyDelete
  2. The google sheets links appear to be broken.

    ReplyDelete