Saturday 3 December 2011

Cassandra on a pile of wreck

NoSQL databases traded ACID compliance for speed and scalablity and I was very interested in how good deal is that for mere mortals like me.

I am experimenting with Cassandra for a while. Version 1.0 was released recently (probably a month ago) and bugfix releases followed quickly, so now the current release is 1.0.5.

A special challange is the server farm I have: no uniform servers. Actually, you could call them a pile of wreck, they would not fit even desktop-class, but that does not matter. There is a trick to ballance the servers right.

  • dummywarhead - this is first cassandra node, the seed of the cluster running Fedora 15, it's hardware is a 512 GB sata hard disk, 2 GB DDR3 RAM and a dual core AMD CPU.
  • maroon - the only branded server in my the farm, an IBM XSeries 226 with hyper-threaded 3Ghz Xeon processor 1 GB DDR2 RAM and an old 250 GB sata hard drive. Runs ubuntu. (Yeah, I was lazy and I have just left it running ubuntu, it does not really matter from performance point of view)
    I really like the IBM casis, but the server generates terrible noise when under load.
    As it turned out, I can expect 50% of dummywarhead's CPU performance from this server, and exactly half as much memory.
  • Switch: TP-LINK 5-port gigabit switch - this should not be a bottleneck in the system. Cat-6 cables, gigabit network interfaces installed in all of the nodes.
  • Load generator: Lenovo T520 Tankpad: 4 GB ram, 4 cores. (Red Hat work laptop)
My test is very simple: In each iteration I write 100.000 records to Cassandra, and then I read 100.000 random records from all the data I have inserted so far. I am trying to avoid tricks with the setup, but I had to do some in order to get some nice results.
  • Since I do not do updates, the repair on read chance is set to zero.
  • I abandoned cassandra's thrift client after the first few tests and moved to hector. Cassandra's client seemed to be unable to deal with dead connections, database topology discovery, etc. Hector does not have any problem with all these.
    Cassandra's client also seem to have_a_problem with javaNamingConventions, it was a bit confusing.
  • It seems that Cassandra (up to 1.0.5) assumes that hosts are unique, so when a new node joins the network, it ballances the database for equal load automatically. In order to get a good result out of the test, I had to update the tokens before starting the loadtest. The logic is quite simple, you just have to distribute the keys between 0 and 2^127, this will determine the load on the servers. Since I assumed that dummywarhead can do twice as much load as maroon, I generated token 56713727820156410577229101238628035242 for dumywarhead and 113427455640312821154458202477256070484 for maroon. This will put 66% of the load on the first, and 33% of the load on the second node.


On this chart you see a single node test. The first thing you notice is that the read times (red) get a little slower until the half of the test, it goes unstable and at the end of the test it is 10 times slower than at the half of the test. This is because of the amount of memory I have in the test server. When it ran out of the memory, in each round it had to find something on disk more and more often.
The second interesting thing here is that the writes are so nice and stable.


On this second test I have let cassandra ballance the database the way she wanted. I started up both dummywarhead and maroon, and started the test after maroon successfuly joined the ring.
As you can see, in this case the collapse of read operations happened much later and did not reach the 200 second limit that the previous single node test did. However, this it is far from being half as much...
While watching the load on the servers, maroon started to do heavy read operations soon since it ran out of RAM. The other server, dummywarhead did not start read operations at all until the end of the test.
I made the first two charts so that it is easy for you to compare the two. The third chart is still belong to the second experiment. I could not resist to do some modifications and I used the nodetool move command to re-allign the load according to the real capabilities of the servers. So this is the same test, continued.


What you see here is that the response time got crazy long for a while when I started to re-allign the cluster, it is a huge peak. And then, when the operations finished, the response time dropped again. The sharp-eyed may notice that the write operations also got 20% faster after re-allign.

So far so good... but.

Joining nodes to the cluster under load takes a lot of time. Really a lot, even with small databases. I am sure one more server to a 100-node cluster is no problem, but adding a second node to a single node really kicks. And then it reballances the cluster to the (bad) defaults, which drops performance. So at the moment runing cassandra on non-uniform nodes seems to be difficult. Plus, scaling out did not immediately made things better, it dropped performance for a while.
The Cassandra cookbook says that the move operation actually removes the node and re-adds it to the cluster with the new token. This may have changed since Cassandra 0.5, it is not that heavy operation, as far as I can tell, it did not remove the node, it only moved it to 'Moving' state for a while.

One can play a lot with this software, it has some interesting concepts.

No comments:

Post a Comment