Dummy Warhead: 2011

Friday, 9 December 2011

PostgreSQL - enums vs varchar

Work-related topic today. OVirt stores java enum values as integer in database. Java 1.5 enums are kind of great tools in the language, but if you accidentaly re-order them, the ordinal changes, which is tragic for the records persisted in the database.

One solution is that you add a final integer field to your enum, set it from a private constructor, add a getter, and use it in your database persistence code. Ah yes, this lots of code to be written and maintained... and basically we are just lazy, just like anyone else. [Anyway, this solution is actually used by Ovirt today.]

Another possible solution that for example JPA supports is that you persist your enum as varchar, which is great, but varchars are bigger than integers. In this case it is kind of difficult to rename anything in the enum, names are more human-friendly than numbers, so you may not want to do this very often.

So lots of guys came up with the question: why not database enums. Enums are a very unique solution in PostgreSQL. [Relational databases are full of unique solutions, but fortunately not as much as NoSQL-databases, at least they somewhat agree in SQL syntax :-) ] PostgreSQL has enum. Enum is cool, it takes only 4 bytes, and can be read and written as character data, it can also save your application from invalid data, e.g. a 'rde' inserted accidentally as color;
When extending an enum, you can add or remove new elements with 'ALTER TYPE', however, this command is a new feature in the latest and greatest [therefore not that much widespread] version 9.1, it is not available in PostgreSQL 9.0. [This could be a problem for OVirt, since it targets PostgreSQL 8.4]

Ok, enough talking, let's see the data.

Insert performance

As you can see, with enums you can win a nice 10-15% on performance, but this is an edge-case. In this benchmark, I inserted 100.000 records in a single transaction. This is a very rare in applications.
When using multiple transactions, the difference dissappears, since the transaction overhead is significant.

Disk use

Sure, this is why we have come to enum. There is no significant difference in the index size since varchars are not stored in the index. However, the small storage requirement is still great, the database server can cache more useful information in the RAM, which results in faster responses if the database size is big. I mean, bigger than your RAM. [Not sure if this is a significant advantage, oVirt does not need a huge database]

Update performance

Again, enum beats varchar with roughly 10 percent. [This is a relatively small win for OVirt, but e.g. updating statuses on Virtual Machines is a very frequent operation.]

Select

Select is another very frequent operation. As mentioned, enums takes less space, more data can fit in the memory.

Let's see some query plans, the first one is for the varchar column that has an index on it.

temp=# EXPLAIN ANALYZE select * from vi where birdtype = 'cuckoo';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on vi (cost=7.44..22.59 rows=412 width=10) (actual time=0.032..0.032 rows=0 loops=1)
Recheck Cond: ((birdtype)::text = 'cuckoo'::text)
-> Bitmap Index Scan on vi_birdtype_idx (cost=0.00..7.34 rows=412 width=0) (actual time=0.029..0.029 rows=0 loops=1)
Index Cond: ((birdtype)::text = 'cuckoo'::text)
Total runtime: 1.790 ms
(5 rows)

And then the same with the enum column.

temp=# EXPLAIN ANALYZE select * from vi where birdtype = 'eagle';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on vi (cost=707.02..1998.61 rows=19967 width=10) (actual time=2.465..7.674 rows=20000 loops=1)
Recheck Cond: ((birdtype)::text = 'eagle'::text)
-> Bitmap Index Scan on vi_birdtype_idx (cost=0.00..702.03 rows=19967 width=0) (actual time=2.378..2.378 rows=20000 loops=1)
Index Cond: ((birdtype)::text = 'eagle'::text)
Total runtime: 8.738 ms
(5 rows)

No significant difference, because of the same index size / efficency.

Saturday, 3 December 2011

Cassandra on a pile of wreck

NoSQL databases traded ACID compliance for speed and scalablity and I was very interested in how good deal is that for mere mortals like me.

I am experimenting with Cassandra for a while. Version 1.0 was released recently (probably a month ago) and bugfix releases followed quickly, so now the current release is 1.0.5.

A special challange is the server farm I have: no uniform servers. Actually, you could call them a pile of wreck, they would not fit even desktop-class, but that does not matter. There is a trick to ballance the servers right.

dummywarhead - this is first cassandra node, the seed of the cluster running Fedora 15, it's hardware is a 512 GB sata hard disk, 2 GB DDR3 RAM and a dual core AMD CPU.
maroon - the only branded server in my the farm, an IBM XSeries 226 with hyper-threaded 3Ghz Xeon processor 1 GB DDR2 RAM and an old 250 GB sata hard drive. Runs ubuntu. (Yeah, I was lazy and I have just left it running ubuntu, it does not really matter from performance point of view)
I really like the IBM casis, but the server generates terrible noise when under load.
As it turned out, I can expect 50% of dummywarhead's CPU performance from this server, and exactly half as much memory.
Switch: TP-LINK 5-port gigabit switch - this should not be a bottleneck in the system. Cat-6 cables, gigabit network interfaces installed in all of the nodes.
Load generator: Lenovo T520 Tankpad: 4 GB ram, 4 cores. (Red Hat work laptop)

My test is very simple: In each iteration I write 100.000 records to Cassandra, and then I read 100.000 random records from all the data I have inserted so far. I am trying to avoid tricks with the setup, but I had to do some in order to get some nice results.

Since I do not do updates, the repair on read chance is set to zero.
I abandoned cassandra's thrift client after the first few tests and moved to hector. Cassandra's client seemed to be unable to deal with dead connections, database topology discovery, etc. Hector does not have any problem with all these.
Cassandra's client also seem to have_a_problem with javaNamingConventions, it was a bit confusing.
It seems that Cassandra (up to 1.0.5) assumes that hosts are unique, so when a new node joins the network, it ballances the database for equal load automatically. In order to get a good result out of the test, I had to update the tokens before starting the loadtest. The logic is quite simple, you just have to distribute the keys between 0 and 2^127, this will determine the load on the servers. Since I assumed that dummywarhead can do twice as much load as maroon, I generated token 56713727820156410577229101238628035242 for dumywarhead and 113427455640312821154458202477256070484 for maroon. This will put 66% of the load on the first, and 33% of the load on the second node.

On this chart you see a single node test. The first thing you notice is that the read times (red) get a little slower until the half of the test, it goes unstable and at the end of the test it is 10 times slower than at the half of the test. This is because of the amount of memory I have in the test server. When it ran out of the memory, in each round it had to find something on disk more and more often.
The second interesting thing here is that the writes are so nice and stable.

On this second test I have let cassandra ballance the database the way she wanted. I started up both dummywarhead and maroon, and started the test after maroon successfuly joined the ring.
As you can see, in this case the collapse of read operations happened much later and did not reach the 200 second limit that the previous single node test did. However, this it is far from being half as much...
While watching the load on the servers, maroon started to do heavy read operations soon since it ran out of RAM. The other server, dummywarhead did not start read operations at all until the end of the test.
I made the first two charts so that it is easy for you to compare the two. The third chart is still belong to the second experiment. I could not resist to do some modifications and I used the nodetool move command to re-allign the load according to the real capabilities of the servers. So this is the same test, continued.

What you see here is that the response time got crazy long for a while when I started to re-allign the cluster, it is a huge peak. And then, when the operations finished, the response time dropped again. The sharp-eyed may notice that the write operations also got 20% faster after re-allign.

So far so good... but.

Joining nodes to the cluster under load takes a lot of time. Really a lot, even with small databases. I am sure one more server to a 100-node cluster is no problem, but adding a second node to a single node really kicks. And then it reballances the cluster to the (bad) defaults, which drops performance. So at the moment runing cassandra on non-uniform nodes seems to be difficult. Plus, scaling out did not immediately made things better, it dropped performance for a while.
The Cassandra cookbook says that the move operation actually removes the node and re-adds it to the cluster with the new token. This may have changed since Cassandra 0.5, it is not that heavy operation, as far as I can tell, it did not remove the node, it only moved it to 'Moving' state for a while.

One can play a lot with this software, it has some interesting concepts.

Sunday, 6 November 2011

JSON vs XML

A few days ago I asked a few guys at work that if they were given a choice, would they choose JSON or XML as a format for a REST API. Everyone responded in favor of JSON. What is surprising in this is that they were all java developers, not a single javascript guy. Maybe it is too early to say this, but it seems that after the initial success of XML in the previous decade and stagnation in the last few years, the XML as format is slowly getting to decline, while more compact and simple, schemaless formats seem to rise: JSON and YAML.
JSON and YAML implementations are just as simple as their formats, you can choose from very many of them.

Question: How do JSON parsers compare to XML parsers from performance perspective?

I wrote sample inputs in XML and JSON. The same data, basically. For XML, I have made two input files. The first is the 'traditional' XML, no attributes at all. The second uses some attributes. The third is the JSON version. I used very few XML apis, only the ones packaged into JDK, this is a bit unfair because the alternative XML apis may be a little faster than the default. I have made a test a few years age and that showed them not so much different, so I was more interested in JSON. With JSON format, I used 5 different parsers: json.org, jackson, jsonlib, jsonj and gson.

Even the slowest JSON parser is faster than the XML parsers of JDK, also the format is more compact, looks like it is win. Jackson is much faster than the rest. It's website is right, it is very fast.

Test source code: here.

Saturday, 29 October 2011

turning simple into difficult: date formats

SimpleDateFormat is a nice little class that helps programmers turn a date to string and vice versa. It is very simple to use and it is packaged with standard java since 1.2 probably... that was very long ago. And of course we use it a lot.

At the other hand, SimpleDateFormat is kind of pain in the back because it is not thread-safe and never going to be. I was always wondering why, it seems to be a terribly wrong design decision. It makes things a bit difficult with SimpleDateFormat. Most of the time you just want to create a final static SimpleDateFormat field with your preferred format. If you do this, your code is probably already broken, under high load it will produce stupid results and you will not know why. (Unless you use PMD with the proper rules) If your class needs to be threadsafe -most of them do- the best thing to do is to create a new instance of SimpleDateFormat when you need it, as a local varible in a method.

What is painful with this? It's performance sucks.

This is what everyone knows. Let's see how to deal with the problem:

Workaround :-)
You can have a final static SimpleDateFormat object, and anytime you want to use it, you clone it and use the clone to do the job. Funny right?
This is a theoretical solution. Do not use this! The java culture chose a completely different approach and year by year thousands of java developers get heart-attack after finding a final static SimpleDateFormat field in the code.
Commons Lang is around for a good while as well. Life would be hell without the apache commons projects. Commons-lang provides some support (FastDateFormat) to deal with date formating. Unfortunately it does not support parse. Only format :-( but even that is more than nothing.

Now of course the choice depends on your taste, but I thought I will help you with a short performance benchmark on the topic:

new SimpleDateFormat 100000 times : 170 ms

SimpleDateFormat clone 100000 times : 68 ms

DateFormatUtils 100000 times : 62 ms

FastDateFormat 100000 times : 55 ms
Update: the two lines updated from Zsombor's ThreadLocal idea. Wow, this must be some mistake!

SimpleDateFormat form pre-created ThreadLocal 100000 times : 29 ms
SimpleDateFormat in ThreadLocal check each time 100000 times : 28 ms

For me it is surprising again and again, how bad the SimpleDateFormat performs when you create it again and again. More than half of the processing is just creating the SimpleDateFormat object, an enormous waste of resources. And then once it is created, it is very quick.

Another funny link for jarvana to find out how many packages has a FastDateFormat. I can't decide if it is tragic or funny...

Test code: here.
java version: sun jdk 1.6.22
OS: linux (fedora 15)
Hardware: Lenovo T520 with Intel core i7 vpro (thank you Red Hat :) )

Friday, 12 August 2011

...and...

...and I am back.

I moved to Brno, Czech Republic and now I am working for Red Hat.

Since I am not good in czech and most people in Brno are not good in english, it took a few weeks, but now I have internet connection at home and I will continue my performance check soon. Recently I have been playing with Cassandra and Infinispan. Cool stuff :-)

Wednesday, 15 June 2011

Serialization speed

Serialization is probably the most simple way to send data to an output stream. The three most typical use cases:

Session replication in servlet containers
A simple way to save information to disk
RMI uses it heavily

Of course session replication is kind of important and it would be nice if it was quick. Therefore, I was very interested in the cost of serialization in general, with small data structures and bigger ones alike.

Serialization is smart, it does not serialize the same object twice, it just sends a reference to an already serialized object, this is the mechanism that allows deserialization reconstruct the same object-graph on the other side of the stream. The small drawback of this smart behavior is that if you send through lots of data on the ObjectOutputStream, you should call reset() time after time again, otherwise the ObjectOutputStream will hold a reference to all objects you have serialized and sooner or later you get an OOM. The reset method clears the ObjectOutputGraph's object/ID tracking and subsequent calls to writeObject will actually serialize the data again. So in my tests this method is called after each serialization. Well, there are a couple of ways you can shoot yourself on the leg with serialization :-)

And for IO stream, I implemented something similar to commons-io NullOutputStream, since it is the serialization speed I want to know, not the IO speed. Here comes the google eye-candy:

So... what this tells me is that when you have problems with session replication, your problem is not the serialization, it is more likely the IO speed. While serialization produced 600 MB in just 4 seconds, a gigabit network with tcp/ip connection can transport roughly 60-70 MB/sec, so it will need at least 10 secs to transport your data.
If a component is slow, use it sparingly :-)

Java version: 64 bit Sun/Oracle JDK 1.6
OS: Linux (fedora)

Thursday, 19 May 2011

The distribution of java.lang.String hash values

I believe most programs use String keys for HashMap's, because Strings are:

immutable - mutable objects can do tricks in HashMaps and this is not funny when you spend a night debugging them
final - you can not override it with some smart code
and of course Strings represent most of the data stored in business apps (names, filesystem paths, etc)

So, my question is... How does plain english wikipedia article titles play with the String hash algorithm? Is it random like the white-noise or can we see some interesting hills and walleys?

Also, what outputs do other languages generate? E.g. hungarian language has some special extra characters compared to english, chinese writing has some 47.000 extra characters.

Let's see: A java int is a signed 32-bit value. The minimal value is -2147483648, the maximum is 2147483647 and there are 8472583 articles in the latest english wikipedia dump. Therefore there are roughly 506 times as much hash keys as wikipedia articles, at least for english.

I am not going to create 2^32 counters for each possible hash, I will just take the hash/1000000 and put them in a HashMap<Integer, Integer>. (Why not, Integer is also ok for hash key, I love the simple implementation of Integer.hash). So in average there should be a little more than two in each counter, if the distribution is even.
Sorry for the static chart instead of the interactive one, the interactive would seriously kill your browser... So here it comes.

The distribution of english wikipedia title hash values

The distribution of the hungarian wikipedia title hash values

The distribution of the chinese wikipedia title has values

Now there are some similarities. It seems more like a long tail model than an even distribution, there are huge amount of hash values between zero and a million, but the most of the hash values are elsewhere.
There are some peaks that both hungarian and english wikipedia shares, and they both have some unique peaks. The chinese does not have these peaks.

So my conclusion is that the distribution of hash keys somewhat depends on what you use it for, but it is not bad anyway.

I always wanted to know if it is really worth for String to cache the hash value in the 'hash' field, or wouldn't it make more sense to make it transient at least?

Update for my stupid question above: serialization handles String objects specially, they are not serialized as other beans, therefore the hash field is not included. The hash value caching is still an interesting question, I will look into it.

Saturday, 14 May 2011

Volatile

Volatile is a rarely used keyword in Java. If you have never used it, don't worry, you are almost certainly right! All it does is that it enforces the program to read the value from the RAM rather than using a cache, so you can be sure that you got the fresh value at least at the moment when you read it. It has a somewhat better performance than a synchronize block since it does not lock. However, you run into trouble if you also want to write the data back, because even a ++ operation is non atomic, it is a read, a calculation and a write, therefore the probability of a wrong result is high.

I was interested in the following questions:

How much slower is the access to volatile fields?
What is the cost of synchronization?
What is the probability of the bad result coming from wrong multi-threading code?
How does the synchronization compare to modern alternatives like STM?

So I wrote a very simple test code to answer all four questions, a Counter interface with 4 implementations:

Simple counter
Synchronized counter
Counter with volatile field
STM counter with Multiverse

The test code starts a number of threads and shares a single counter with them. All threads call hit() on the counter exactly 1 million times and then terminates. The test waits for all the threads to finish and then checks the Counter. It should be exactly a million times the number of threads. And of course, we have two BAD implementation here (number 1 and 3), where I only wanted to know how wrong they are.

Test environment: the usual dual-core AMD nuclear submarine with 2 GB 1066 Hz memory, Linux and java 1.6

Code: https://dummywarhead.googlecode.com/hg/volatilepenalty/

Conclusions

Access to volatile fields is much slower of course, than just normal fields. The results say it is about 4 times that slow, but I believe this also depends on your RAM speed. It is actually not much better than a synchronized block.
The cost of synchronization is high indeed if you just look at those lines, but not high at all if you know that the "simple" and "volatile" solutions produce wrong results.
The probability of bad result coming from wrong concurrency code is huge. If you frequently update something from multiple threads, you need to think about synchronization. Well, this test really is an edge case, but never mind.
From the first moment when I heard of software transactional memory, I love the idea. It sounds just great. But in this test it does not perform great, at least not on 2 cores, but this is something the wiki page mentioned as well. It would be a nice to run it on a 4-core or 8-core computers just to see how the figures change, but my guess is that it does not improve, because it needs to do way too many rollbacks. Optimistic locking should perform better on 4+ cores when the probability of collission is relatively small. This is not actually a fair test for STM, it really needs a smarter one.

About the error rates: My first impression was that the volatile solution produces even higher error than the simple one, but now I am not quite sure. But anyway, they are both just wrong.
Think before Thread.start()!

Friday, 22 April 2011

JMS and ActiveMQ's configurations

JMS is kind of important and mature piece of the java server side architecture. It is used in quite a lot of situations.

You want to make sure the message will be received but you do not want to wait until it is received :)
You want to send a message to multiple targets
You do not need an answer immediately and you do not want to overload the target system

Also, in some of these cases it is critical to have the message persisted, while other cases really don't. For example typical business messages require persistence. Even if the server fails, after recovering the disks, you still have the message and sooner or later it will be processed.
A distributed ehcache synchronization really does not need any persistence. If the message is lost because of a power outage, the speed and throughput is much more important than persistence. Being honest I would use RMI for ehcache synchronization rather than JMS but RMI is not a good choice when you have firewalls between your servers. Most software developer does not have a choice, JMS is the standard for all kind of messaging, it is part of JEE/J2EE and most other platforms, it is everywhere.

The market for JMS products is narrowed down to a few big players and some agile small ones. The big players are Apache ActiveMQ and IBM's Websphere MQ. If you look at the google trends graphs of the two products, Websphere is in slow decline while ActiveMQ has a stable hit rate at google search. ActiveMQ generates bigger activity since 2009, while Websphere MQ keeps generating money for IBM.

So, if you want a mainstream JMS server, ActiveMQ is the only choice. I would never recommend IBM software for anyone :-)

The small (but agile) players...

ActiveMQ is blue, RabbitMQ is red, it has a nice activity especially since it is backed by springsource/vmware, and HornetQ is the yellow one.

So I was very interested in ActiveMQ and how hard can we kick into it before it is becoming a bottleneck in the system. I used the default configuration and a single threaded sender and receiver to send and receive a text message and it constantly and reliably did not let through more than 25 messages in the second. This was bad experience of course, one would expect much more than that. So I looked into it, and the default configuration runs with KahaDB and forces the message synced to the disk. I disabled the disk sync and it increased the speed to 300 messages/sec. That is closer to what I would like to see :-)
Another persistence options is JDBC, I used the default derby just for a comparison and it produced 100 messages / sec, still not that bad, it would make some sense to give it a try with MySQL and PostgreSQL. With no persistence at all: 700 messages / second. Not bad...

The default configuration is good for anyone. It is safe but its performance is not that impressive. Some applications do not need this safe operation and you can use another storage options that performs much better. Now if you have the two different need in one application, then it seems that you can not configure each queue to behave differently. They will be all forced to write to disk or all of them will be not synced to disk, so it seems to me that in such cases it is better to have two differently configured ActiveMQ for the two different requirement.

Thursday, 31 March 2011

Playing with RMI

RMI is a popular RPC solution shipping with JDK since 1.1, however it's popularity faded with the dawn of the web service stacks. RMI is kind of old school, it never played nice with firewalls and proxies, it has never been a good choice for system integration, but a simple and well-known solution for pure-java RPC and in that point of view it is still kicking.

I was interested in the following aspects of using RMI:

Response time - how many remote procedure calls can my code do in a second?
Scalablity - how does the response time change when I use multiple threads?
How does the response time change with the argument size?

Test Hardware

The hardware that I used is not some high-end server, my desktop computer and my previous desktop computers. The RMI server is run by a dual core AMD box, the RMI client is a less powerfull old AMD 1800 Mhz proc. The network between the client and the server is a gigabit ethernet with a switch. Switches are known to increase network latency.
Both computers are running relatively up to date Linux versions (fedora). The test computers generate a unpleasant noise level and I had to wait for the opportunity to finish this test.

There were issues during the test. The old AMD server suddenly kernel-crashed under high load. I do not know yet if this is a hardware failure or a kernel problem, I will repeat the test with another test client.

Test method

The client opens a single connection to the RMI server and starts N threads and sends over a byte array that the server sends back. Very simple test.
I measured the time needed to do 10000 calls. If there are 3 threads, then each thread does 10000/3=3333 calls. This is not quite accurate, but who cares about that last one call.
Code to be shared...

Results

Being honest, RMI was a positive surprise for me, very nice results from such an old-timer.

The network latency

Today's server farms are usually built with gigabit network or better. The 10gigabit networks are getting more and more popular as well, however they are not yet available in the hardware store of mere mortals. Therefore I can not benchmark a 10gigabit network, but I repeated the same test after scaling down my gigabit network to 100 Mb/sec and add some extra network latency by pluging the test server into the 10/100 router. Therefore now the IP pockets are passed through the gigabit switch and the 10/100 router. And the results are lovely.

Conclusions

The network latency is significant if you are in hurry, but if you use several threads, you can work it around. The more the network latency, the more threads...
Anyway, if you need a good response time, do not waste your processing-time on any kind of RPC :)
Sending twice us much data does not take twice as much time, at least as long as we are dealing with small records up to 8192 bytes. This is likely because of the TCP and ethernet overhead, it is very significant with small amounts of data transfered. Actually, the difference is very small, so it makes sense to be smart and send more data with one call than than doing several small data transfers. This is where the Facade design pattern can help.
As you can see, as the size of the transferred data grows over a certain size, the response time is starting to grow accordingly.

Thursday, 10 March 2011

The good old StringBuffer vs StringBuilder

Everyone (or at leasr almost everyone) knows that StringBuffer is synchronized and therefore slower than StringBuilder. How do they compare? It depends on a lots of things, e.g. if you append character by character, you will find that StringBuilder is much faster because it avoided synchronization. If you call append just a couple of times with relatively big strings, the difference will be very little.

However, there is another factor here. You can specify initial capacity for both StringBuilder and StringBuffer, and if you don't (and you do not have pass over an initial string either) the capacity will be set to 16. Not much, but at least not wasting the memory :) When they run out of space while appending, they both double the capacity by allocating a new memory area and arrayCopy the old content. This seems to be something where you can gain a little speed again. If you have a guess for the length of the produced string, you can avoid at least the first some memory allocation and arrayCopy.

Let's see how much it matters...

As you can see, there is a huge difference between StringBuffer and StringBuilder, since this test calls append very many times with very short strings. The another difference between pre-allocated memory and the one slowly growing from 16. Now this test constructed an 1024 character length string, therefore with a good initial capacity the pre-allocated version saved 6 re-allocation. And there is the difference, with a good guess, you can still save lots of processing time.

Now let's rewrite the code and instead of a creating the long string by appending single characters, we can use bigger strings and the bigger the strings, the smaller the difference between created by synchronization and at one point, pre-allocation of the space will have more benefit than the synchronization question, and this could be important. This below chart was generated using 64-char strings to construct the same size at the end.

JVM: 64-bit server JVM 1.6.0_22-b04

Wednesday, 9 March 2011

Starting up...

Hi,

I am Laszlo, .* developer at Duct-tape Solutions Inc. I am working for about 11 years in the information technology industry and I use java probably for 10 years on a daily basis. This is my new blog about java tools focusing on performance. Topics that I would like to cover:

Performance considerations of java core classes
Comparison of standard implementations, e.g. Tomcat versus Jetty, ActiveMQ versus HornetQ
Architecture - architecture is where most projects go wrong
Scaling out: Clustering and computation models - e.g. hadoop versus terracotta
Scaling up: Concurrency, memory, garbage collector and other JVM parameter tuning

Guidelines for posts:

I will share the test code for each post. I will have a mercurial repository at google code.
I will use maven as build tool whenever possible, just to keep things simple.
Graphs
Description of the hardware and software environment. OS, java version, java runtime parameters, hardware components and so on...

Anyway, this is my first blog in english, sorry about the grammar mistakes. I hope you will enjoy!