Thursday, 19 May 2011

The distribution of java.lang.String hash values

I believe most programs use String keys for HashMap's, because Strings are:
  • immutable - mutable objects can do tricks in HashMaps and this is not funny when you spend a night debugging them
  • final - you can not override it with some smart code
  • and of course Strings represent most of the data stored in business apps (names, filesystem paths, etc)
So, my question is... How does plain english wikipedia article titles play with the String hash algorithm? Is it random like the white-noise or can we see some interesting hills and walleys
Also, what outputs do other languages generate? E.g. hungarian language has some special extra characters compared to english, chinese writing has some 47.000 extra characters.

Let's see: A java int is a signed 32-bit value. The minimal value is -2147483648, the maximum is 2147483647 and there are 8472583 articles in the latest english wikipedia dump. Therefore there are roughly 506 times as much hash keys as wikipedia articles, at least for english.

I am not going to create 2^32 counters for each possible hash, I will just take the hash/1000000 and put them in a HashMap<Integer, Integer>. (Why not, Integer is also ok for hash key, I love the simple implementation of Integer.hash). So in average there should be a little more than two in each counter, if the distribution is even.
Sorry for the static chart instead of the interactive one, the interactive would seriously kill your browser... So here it comes.
The distribution of english wikipedia title hash values

The distribution of the hungarian wikipedia title hash values
The distribution of the chinese wikipedia title has values

Now there are some similarities. It seems more like a long tail model than an even distribution, there are huge amount of hash values between zero and a million, but the most of the hash values are elsewhere.
There are some peaks that both hungarian and english wikipedia shares, and they both have some unique  peaks. The chinese does not have these peaks.

So my conclusion is that the distribution of hash keys somewhat depends on what you use it for, but it is not bad anyway.

I always wanted to know if it is really worth for String to cache the hash value in the 'hash' field, or wouldn't it make more sense to make it transient at least?

Update for my stupid question above: serialization handles String objects specially, they are not serialized as other beans, therefore the hash field is not included. The hash value caching is still an interesting question, I will look into it.

Saturday, 14 May 2011


Volatile is a rarely used keyword in Java. If you have never used it, don't worry, you are almost certainly right! All it does is that it enforces the program to read the value from the RAM rather than using a cache, so you can be sure that you got the fresh value at least at the moment when you read it. It has a somewhat better performance than a synchronize block since it does not lock. However, you run into trouble if you also want to write the data back, because even a ++ operation is non atomic, it is a read, a calculation and a write, therefore the probability of a wrong result is high.

I was interested in the following questions:

  1. How much slower is the access to volatile fields?
  2. What is the cost of synchronization?
  3. What is the probability of the bad result coming from wrong multi-threading code?
  4. How does the synchronization compare to modern alternatives like STM?
So I wrote a very simple test code to answer all four questions, a Counter interface with 4 implementations:
  1. Simple counter
  2. Synchronized counter
  3. Counter with volatile field
  4. STM counter with Multiverse
The test code starts a number of threads and shares a single counter with them. All threads call hit() on the counter exactly 1 million times and then terminates. The test waits for all the threads to finish and then checks the Counter. It should be exactly a million times the number of threads. And of course, we have two BAD implementation here (number 1 and 3), where I only wanted to know how wrong they are.

Test environment: the usual dual-core AMD nuclear submarine with 2 GB 1066 Hz memory, Linux and java 1.6



  1. Access to volatile fields is much slower of course, than just normal fields. The results say it is about 4 times that slow, but I believe this also depends on your RAM speed. It is actually not much better than a synchronized block.
  2. The cost of synchronization is high indeed if you just look at those lines, but not high at all if you know that the "simple" and "volatile" solutions produce wrong results.
  3. The probability of bad result coming from wrong concurrency code is huge. If you frequently update something from multiple threads, you need to think about synchronization. Well, this test really is an edge case, but never mind.
  4. From the first moment when I heard of software transactional memory, I love the idea. It sounds just great. But in this test it does not perform great, at least not on 2 cores, but this is something the wiki page mentioned as well. It would be a nice to run it on a 4-core or 8-core computers just to see how the figures change, but my guess is that it does not improve, because it needs to do way too many rollbacks. Optimistic locking should perform better on 4+ cores when the probability of collission is relatively small. This is not actually a fair test for STM, it really needs a smarter one.

About the error rates: My first impression was that the volatile solution produces even higher error than the simple one, but now I am not quite sure. But anyway, they are both just wrong.
Think before Thread.start()!