Thursday 19 May 2011

The distribution of java.lang.String hash values

I believe most programs use String keys for HashMap's, because Strings are:
  • immutable - mutable objects can do tricks in HashMaps and this is not funny when you spend a night debugging them
  • final - you can not override it with some smart code
  • and of course Strings represent most of the data stored in business apps (names, filesystem paths, etc)
So, my question is... How does plain english wikipedia article titles play with the String hash algorithm? Is it random like the white-noise or can we see some interesting hills and walleys
Also, what outputs do other languages generate? E.g. hungarian language has some special extra characters compared to english, chinese writing has some 47.000 extra characters.

Let's see: A java int is a signed 32-bit value. The minimal value is -2147483648, the maximum is 2147483647 and there are 8472583 articles in the latest english wikipedia dump. Therefore there are roughly 506 times as much hash keys as wikipedia articles, at least for english.

I am not going to create 2^32 counters for each possible hash, I will just take the hash/1000000 and put them in a HashMap<Integer, Integer>. (Why not, Integer is also ok for hash key, I love the simple implementation of Integer.hash). So in average there should be a little more than two in each counter, if the distribution is even.
Sorry for the static chart instead of the interactive one, the interactive would seriously kill your browser... So here it comes.
The distribution of english wikipedia title hash values

The distribution of the hungarian wikipedia title hash values
The distribution of the chinese wikipedia title has values

Now there are some similarities. It seems more like a long tail model than an even distribution, there are huge amount of hash values between zero and a million, but the most of the hash values are elsewhere.
There are some peaks that both hungarian and english wikipedia shares, and they both have some unique  peaks. The chinese does not have these peaks.

So my conclusion is that the distribution of hash keys somewhat depends on what you use it for, but it is not bad anyway.

I always wanted to know if it is really worth for String to cache the hash value in the 'hash' field, or wouldn't it make more sense to make it transient at least?

Update for my stupid question above: serialization handles String objects specially, they are not serialized as other beans, therefore the hash field is not included. The hash value caching is still an interesting question, I will look into it.

No comments:

Post a Comment