Dummy Warhead

Sunday, 6 November 2011

JSON vs XML

A few days ago I asked a few guys at work that if they were given a choice, would they choose JSON or XML as a format for a REST API. Everyone responded in favor of JSON. What is surprising in this is that they were all java developers, not a single javascript guy. Maybe it is too early to say this, but it seems that after the initial success of XML in the previous decade and stagnation in the last few years, the XML as format is slowly getting to decline, while more compact and simple, schemaless formats seem to rise: JSON and YAML.
JSON and YAML implementations are just as simple as their formats, you can choose from very many of them.

Question: How do JSON parsers compare to XML parsers from performance perspective?

I wrote sample inputs in XML and JSON. The same data, basically. For XML, I have made two input files. The first is the 'traditional' XML, no attributes at all. The second uses some attributes. The third is the JSON version. I used very few XML apis, only the ones packaged into JDK, this is a bit unfair because the alternative XML apis may be a little faster than the default. I have made a test a few years age and that showed them not so much different, so I was more interested in JSON. With JSON format, I used 5 different parsers: json.org, jackson, jsonlib, jsonj and gson.

Even the slowest JSON parser is faster than the XML parsers of JDK, also the format is more compact, looks like it is win. Jackson is much faster than the rest. It's website is right, it is very fast.

Test source code: here.

Saturday, 29 October 2011

turning simple into difficult: date formats

SimpleDateFormat is a nice little class that helps programmers turn a date to string and vice versa. It is very simple to use and it is packaged with standard java since 1.2 probably... that was very long ago. And of course we use it a lot.

At the other hand, SimpleDateFormat is kind of pain in the back because it is not thread-safe and never going to be. I was always wondering why, it seems to be a terribly wrong design decision. It makes things a bit difficult with SimpleDateFormat. Most of the time you just want to create a final static SimpleDateFormat field with your preferred format. If you do this, your code is probably already broken, under high load it will produce stupid results and you will not know why. (Unless you use PMD with the proper rules) If your class needs to be threadsafe -most of them do- the best thing to do is to create a new instance of SimpleDateFormat when you need it, as a local varible in a method.

What is painful with this? It's performance sucks.

This is what everyone knows. Let's see how to deal with the problem:

Workaround :-)
You can have a final static SimpleDateFormat object, and anytime you want to use it, you clone it and use the clone to do the job. Funny right?
This is a theoretical solution. Do not use this! The java culture chose a completely different approach and year by year thousands of java developers get heart-attack after finding a final static SimpleDateFormat field in the code.
Commons Lang is around for a good while as well. Life would be hell without the apache commons projects. Commons-lang provides some support (FastDateFormat) to deal with date formating. Unfortunately it does not support parse. Only format :-( but even that is more than nothing.

Now of course the choice depends on your taste, but I thought I will help you with a short performance benchmark on the topic:

new SimpleDateFormat 100000 times : 170 ms

SimpleDateFormat clone 100000 times : 68 ms

DateFormatUtils 100000 times : 62 ms

FastDateFormat 100000 times : 55 ms
Update: the two lines updated from Zsombor's ThreadLocal idea. Wow, this must be some mistake!

SimpleDateFormat form pre-created ThreadLocal 100000 times : 29 ms
SimpleDateFormat in ThreadLocal check each time 100000 times : 28 ms

For me it is surprising again and again, how bad the SimpleDateFormat performs when you create it again and again. More than half of the processing is just creating the SimpleDateFormat object, an enormous waste of resources. And then once it is created, it is very quick.

Another funny link for jarvana to find out how many packages has a FastDateFormat. I can't decide if it is tragic or funny...

Test code: here.
java version: sun jdk 1.6.22
OS: linux (fedora 15)
Hardware: Lenovo T520 with Intel core i7 vpro (thank you Red Hat :) )

Friday, 12 August 2011

...and...

...and I am back.

I moved to Brno, Czech Republic and now I am working for Red Hat.

Since I am not good in czech and most people in Brno are not good in english, it took a few weeks, but now I have internet connection at home and I will continue my performance check soon. Recently I have been playing with Cassandra and Infinispan. Cool stuff :-)

Wednesday, 15 June 2011

Serialization speed

Serialization is probably the most simple way to send data to an output stream. The three most typical use cases:

Session replication in servlet containers
A simple way to save information to disk
RMI uses it heavily

Of course session replication is kind of important and it would be nice if it was quick. Therefore, I was very interested in the cost of serialization in general, with small data structures and bigger ones alike.

Serialization is smart, it does not serialize the same object twice, it just sends a reference to an already serialized object, this is the mechanism that allows deserialization reconstruct the same object-graph on the other side of the stream. The small drawback of this smart behavior is that if you send through lots of data on the ObjectOutputStream, you should call reset() time after time again, otherwise the ObjectOutputStream will hold a reference to all objects you have serialized and sooner or later you get an OOM. The reset method clears the ObjectOutputGraph's object/ID tracking and subsequent calls to writeObject will actually serialize the data again. So in my tests this method is called after each serialization. Well, there are a couple of ways you can shoot yourself on the leg with serialization :-)

And for IO stream, I implemented something similar to commons-io NullOutputStream, since it is the serialization speed I want to know, not the IO speed. Here comes the google eye-candy:

So... what this tells me is that when you have problems with session replication, your problem is not the serialization, it is more likely the IO speed. While serialization produced 600 MB in just 4 seconds, a gigabit network with tcp/ip connection can transport roughly 60-70 MB/sec, so it will need at least 10 secs to transport your data.
If a component is slow, use it sparingly :-)

Java version: 64 bit Sun/Oracle JDK 1.6
OS: Linux (fedora)

Thursday, 19 May 2011

The distribution of java.lang.String hash values

I believe most programs use String keys for HashMap's, because Strings are:

immutable - mutable objects can do tricks in HashMaps and this is not funny when you spend a night debugging them
final - you can not override it with some smart code
and of course Strings represent most of the data stored in business apps (names, filesystem paths, etc)

So, my question is... How does plain english wikipedia article titles play with the String hash algorithm? Is it random like the white-noise or can we see some interesting hills and walleys?

Also, what outputs do other languages generate? E.g. hungarian language has some special extra characters compared to english, chinese writing has some 47.000 extra characters.

Let's see: A java int is a signed 32-bit value. The minimal value is -2147483648, the maximum is 2147483647 and there are 8472583 articles in the latest english wikipedia dump. Therefore there are roughly 506 times as much hash keys as wikipedia articles, at least for english.

I am not going to create 2^32 counters for each possible hash, I will just take the hash/1000000 and put them in a HashMap<Integer, Integer>. (Why not, Integer is also ok for hash key, I love the simple implementation of Integer.hash). So in average there should be a little more than two in each counter, if the distribution is even.
Sorry for the static chart instead of the interactive one, the interactive would seriously kill your browser... So here it comes.

The distribution of english wikipedia title hash values

The distribution of the hungarian wikipedia title hash values

The distribution of the chinese wikipedia title has values

Now there are some similarities. It seems more like a long tail model than an even distribution, there are huge amount of hash values between zero and a million, but the most of the hash values are elsewhere.
There are some peaks that both hungarian and english wikipedia shares, and they both have some unique peaks. The chinese does not have these peaks.

So my conclusion is that the distribution of hash keys somewhat depends on what you use it for, but it is not bad anyway.

I always wanted to know if it is really worth for String to cache the hash value in the 'hash' field, or wouldn't it make more sense to make it transient at least?

Update for my stupid question above: serialization handles String objects specially, they are not serialized as other beans, therefore the hash field is not included. The hash value caching is still an interesting question, I will look into it.

Saturday, 14 May 2011

Volatile

Volatile is a rarely used keyword in Java. If you have never used it, don't worry, you are almost certainly right! All it does is that it enforces the program to read the value from the RAM rather than using a cache, so you can be sure that you got the fresh value at least at the moment when you read it. It has a somewhat better performance than a synchronize block since it does not lock. However, you run into trouble if you also want to write the data back, because even a ++ operation is non atomic, it is a read, a calculation and a write, therefore the probability of a wrong result is high.

I was interested in the following questions:

How much slower is the access to volatile fields?
What is the cost of synchronization?
What is the probability of the bad result coming from wrong multi-threading code?
How does the synchronization compare to modern alternatives like STM?

So I wrote a very simple test code to answer all four questions, a Counter interface with 4 implementations:

Simple counter
Synchronized counter
Counter with volatile field
STM counter with Multiverse

The test code starts a number of threads and shares a single counter with them. All threads call hit() on the counter exactly 1 million times and then terminates. The test waits for all the threads to finish and then checks the Counter. It should be exactly a million times the number of threads. And of course, we have two BAD implementation here (number 1 and 3), where I only wanted to know how wrong they are.

Test environment: the usual dual-core AMD nuclear submarine with 2 GB 1066 Hz memory, Linux and java 1.6

Code: https://dummywarhead.googlecode.com/hg/volatilepenalty/

Conclusions

Access to volatile fields is much slower of course, than just normal fields. The results say it is about 4 times that slow, but I believe this also depends on your RAM speed. It is actually not much better than a synchronized block.
The cost of synchronization is high indeed if you just look at those lines, but not high at all if you know that the "simple" and "volatile" solutions produce wrong results.
The probability of bad result coming from wrong concurrency code is huge. If you frequently update something from multiple threads, you need to think about synchronization. Well, this test really is an edge case, but never mind.
From the first moment when I heard of software transactional memory, I love the idea. It sounds just great. But in this test it does not perform great, at least not on 2 cores, but this is something the wiki page mentioned as well. It would be a nice to run it on a 4-core or 8-core computers just to see how the figures change, but my guess is that it does not improve, because it needs to do way too many rollbacks. Optimistic locking should perform better on 4+ cores when the probability of collission is relatively small. This is not actually a fair test for STM, it really needs a smarter one.

About the error rates: My first impression was that the volatile solution produces even higher error than the simple one, but now I am not quite sure. But anyway, they are both just wrong.
Think before Thread.start()!

Friday, 22 April 2011

JMS and ActiveMQ's configurations

JMS is kind of important and mature piece of the java server side architecture. It is used in quite a lot of situations.

You want to make sure the message will be received but you do not want to wait until it is received :)
You want to send a message to multiple targets
You do not need an answer immediately and you do not want to overload the target system

Also, in some of these cases it is critical to have the message persisted, while other cases really don't. For example typical business messages require persistence. Even if the server fails, after recovering the disks, you still have the message and sooner or later it will be processed.
A distributed ehcache synchronization really does not need any persistence. If the message is lost because of a power outage, the speed and throughput is much more important than persistence. Being honest I would use RMI for ehcache synchronization rather than JMS but RMI is not a good choice when you have firewalls between your servers. Most software developer does not have a choice, JMS is the standard for all kind of messaging, it is part of JEE/J2EE and most other platforms, it is everywhere.

The market for JMS products is narrowed down to a few big players and some agile small ones. The big players are Apache ActiveMQ and IBM's Websphere MQ. If you look at the google trends graphs of the two products, Websphere is in slow decline while ActiveMQ has a stable hit rate at google search. ActiveMQ generates bigger activity since 2009, while Websphere MQ keeps generating money for IBM.

So, if you want a mainstream JMS server, ActiveMQ is the only choice. I would never recommend IBM software for anyone :-)

The small (but agile) players...

ActiveMQ is blue, RabbitMQ is red, it has a nice activity especially since it is backed by springsource/vmware, and HornetQ is the yellow one.

So I was very interested in ActiveMQ and how hard can we kick into it before it is becoming a bottleneck in the system. I used the default configuration and a single threaded sender and receiver to send and receive a text message and it constantly and reliably did not let through more than 25 messages in the second. This was bad experience of course, one would expect much more than that. So I looked into it, and the default configuration runs with KahaDB and forces the message synced to the disk. I disabled the disk sync and it increased the speed to 300 messages/sec. That is closer to what I would like to see :-)
Another persistence options is JDBC, I used the default derby just for a comparison and it produced 100 messages / sec, still not that bad, it would make some sense to give it a try with MySQL and PostgreSQL. With no persistence at all: 700 messages / second. Not bad...

The default configuration is good for anyone. It is safe but its performance is not that impressive. Some applications do not need this safe operation and you can use another storage options that performs much better. Now if you have the two different need in one application, then it seems that you can not configure each queue to behave differently. They will be all forced to write to disk or all of them will be not synced to disk, so it seems to me that in such cases it is better to have two differently configured ActiveMQ for the two different requirement.