Sunday, 26 April 2015

Just a little VM tuning: Memory and CPU saving with KVM + KSM

This topic will be somewhat unusal from a java junkie like me, but hopefully interesting for those who are interested in cloud computing and virtualization. To make it easier to understand for everyone, I will start from far-far away, please just skip ahead if you feel like this is nothing new for you, there may be some interesting pieces of information later on.

The basics

This may not be something new for you, feel free to skip ahead to the hypothesis if you know Linux and virtual memory handling.

Virtual Memory

Modern computers break up the memory into pages. When your program reads or writes a memory address, that translates to a page and through the paging table to a physical address.

This is the so called virtual memory and allows swapping, the OS can just swap out some pages from memory to a larger and cheaper storage (typically a disk). When a page is referenced that is not in the memory, the hardware generates an interrupt and the OS takes over, loads the memory page and gives back control to the program. But not only that is possible...

Linux have a small module built in called Kernel Samepage Merging or shortly KSM This module was actually written by the same guy who wrote KVM, and very likely with KVM in mind, but any other system can benefit.
I'd recommend to read the doc in the kernel documentation, but this is what it does in a nutshell:
  1. Periodically checks memory pages
  2. If two identical pages are found, then they are merged and marked with COW (copy on write) -this is because KSM has no idea what the page is used for, it just merges whatever it finds
So if you have two VM's, both running the same OS, then most pages of the kernel and programs can be shared between the two VM's and they will never know. This can save quite some memory and allows big memory overcommit in virtualized environments, if you accept the price:
  1. KSM takes CPU time. If you have a lot of memory, then it will take a lot of CPU time.
  2. Basically it just does not know when to stop, it just keeps running, so additional software is used to manage it. Like ksmtuned.


While CPU has become faster and faster until the second half of the 2000's, the memory speed did not really keep up with it and CPU's started to use ever growing cache. The cache is in the CPU, it is very quick, but it's size is still limited, even Xeon CPUs have 10 MB of cache, typical desktop CPU's have 1-2 MB.

The hypothesis

Since the cache is small, switching to another VM in a virtualized environment should cause a little performance loss, since the cached pages of the kernel in the VM1 need to be replaced with the actually identical pages of VM2.
The second part of the idea is that KSM could help here by eliminating that performance loss. When pages are shared between the operating systems of the VM's, then a cache miss is less likely after another VM takes the CPU time.
Therefore once pages are merged with KSM and KSMD is turned off, switching between different VM's will be less expensive and respone times improve.

KSM could not only be a memory-saver, but also a CPU-saver.


To test the idea, I prepared 12 web server VM's and one load-balancer. All of them run fedora 20 operating system. The web servers run apache httpd, the load-ballancer runs HAProxy, with more or less default settings. Each VM have 256 MB RAM and a single CPU.

The test host is an Intel NUC D34010WYK with Core i3 CPU (important factors for the test: hyperthreading is enabled, cache size is 3MB) and 2x 8 GB DDR3-1600 RAM.

Nice little box, they could have called it Intel NUKE :)

To generate load, I use the simple apache benchmark (ab) command line utility from my laptop. It is not really relevant, my laptop is a wreck, perfect motivation to speed-optimize software.
Load command:
ab -n 100000 -c 8

(This is the small "Powered by Fedora" banner)


Comments, conclusions

The results with VM numbers > 4 seem to prove the theory, but I was surprised to see the performance loss when the number of VM's was less than 4. I do not have an explaination for that yet.
I suspect that the fall of the no-ksm (blue) curve shows the increase of cache misses, it flats out after 10 VM's, basically by then cache misses are becoming so frequent that they can not be a lot more frequent.

Basically on each CPU and memory you will get different values, also different OS'es and programs will generate different values, the intersection may be somewhere else, but the form of the curves should be similar.


I think it would be interesting to repeat the test with hyperthreading turned off and see how the curve changes.

Monday, 29 December 2014

Performance comparison of JAX-RS 2.0 implementations

I was looking for a short comparison of JAX-RS 2.0 implementations and I could not find one that compared their response times, so I had to do it myself.

I'm sure some will argue that speed is not the most important characteristic of a REST framework, and things like stability, maturity, advanced features matter more. Well, they may or may not be right, but in any case performance is an important aspect.

Tested frameworks

While surely I missed someone's pet JAX-RS framework, I think I have selected the three mainstream ones:
  • CXF - I use it for several years, I have always been a happy user
  • Jersey - The reference implementation from Sun/Oracle
  • Resteasy - The JBoss implementation of JAX-RS

Test method

Project layout
  • I have implemented a single Hello service annotated with JAX-RS annotations, this is in a separate maven module and used by framework-specific webapps.
  • All web modules are configured with Jackson 2
  • For simplicity, the services are registered in spring, however Jersey does not allow it the way Resteasy and CXF do
  • All tests are performed using Jetty 9.2.2.
  • For warm-up, I gave each test 1M requests on a single thread.
  • The test tool was the Apache httpd benchmark utility 'ab'. Each result was performed separately with 100.000 requests and 1,2,4,8,16 concurrent threads.
  • Source code of the tests available on github.
  • All tests performed on the same laptop. Fedora 20, Oracle Java 1.7.0_45, AMD E2-1800 wreck



I was kind of surpried to see how much better Resteasy performed and looked under the hoods of the frameworks to figure out what they do when I hit the URL. Well, after a few hours of digging I concluded that most of it is the abstraction layer. CXF is not only for REST but also for JAX-WS, it can consume messages with JMS and so on, and that abstraction layer takes it.

I have also implemented a very thin experimental JAX-RS 2.0 implementation to see how that works and with the same tests it outperformed the mainstream frameworks with roughly 2000 request/second. I would not dare to replace CXF with that :-D but kind of interesting to think about it.

Sunday, 11 May 2014

Hello Babel!

Lucas van Valckenborch's painting
Even though the new Java 8 is out with long expected features like lambda expressions (closures, for simplicity), millions of software developers are looking for a better java.

There are tons of different reasons why you may consider switching to another language, performance is just one of those. I was interested in a very basic aspect of performance: how quickly is the code compiled from the source code able to create a string.

The tested languages are:
  • Java 1.7
  • Scala 2.10.0
  • Kotlin 0.7.271
  • Groovy 2.3.0
  • Clojure 1.3.10
I am really missing ceylon from the list, but it does not have a maven plugin, it is not even published to maven central or any public repositories, so I skipped it.

And the results are...

Well, the single-threaded performance and simple String handling may not be a good reason to switch at the moment. As you see, java far outperformed the other languages. If you look into the bytecode generated by Scala, Groovy and Clojure, the reason is obvious. It does so many things, that it just can't perform quick.
While Kotlin  performed only about half as quick as the code compiled from java, the problem is a bit harder to spot. So let me highlight the problem...

       0: new           #19                 // class java/lang/StringBuilder
       3: dup          
       4: invokespecial #23                 // Method java/lang/StringBuilder."<init>":()V
       7: ldc           #25                 // String Hello
       9: invokevirtual #29                 // Method java/lang/StringBuilder.append:(Ljava/lang/Object;)Ljava/lang/StringBuilder;
      12: aload_1      
      13: invokevirtual #29                 // Method java/lang/StringBuilder.append:(Ljava/lang/Object;)Ljava/lang/StringBuilder;
      16: ldc           #31                 // String !
      18: invokevirtual #29                 // Method java/lang/StringBuilder.append:(Ljava/lang/Object;)Ljava/lang/StringBuilder;
      21: invokevirtual #35                 // Method java/lang/StringBuilder.toString:()Ljava/lang/String;
The problem is that Kotlin compiler generated code to append method accepting Object arguments, while it is known that the argument is going to be String. Should be an easy fix, I registered a bug in Kotlin's bugtracker.
Update: I played a little with it and found where the code is generated in the compiler. The compiler with my patch generated line by line the same code as the java compiler, and therefore it performed the same.

The truth is, even javac generated suboptimal code, it could be beaten. And next time I will give it a try.

Test environment:
  • AMD E2-1800
  • Oracle java 1.7.0_45
  • Fedora linux
Test code on github.

Wednesday, 19 March 2014

CloudStack UI speed

I do some CloudStack development in my free time and the user interface is my favourite part from CloudStack. It looks nice and practical when in use and it is very clean in the inside, something to learn from. It is a Single-page web-application, so you will not have to wait for full page reloads.

There is a drawback though. When you load the user interface, all the templates and javascripts are loaded. At the moment (cloudstack-4.5.0-SNAPSHOT) this is 5.4 MB, quite a lot for a login page...

Solution 1: Dynamic compression

John Kinsella wrote an excelent blog post on how to make the page loading faster by configuring Apache httpd to do so. The drawback of this solution is that it will try to compress all the 5.4 MB data when you are downloading it.

Solution 2: Static compression

Well, dynamic compression is usually good, but why do you want to compress that big lot of javascript and css again and again? You can compress those files at packaging and serve the compressed version if the browser accepts it. This is a win-win situation because no CPU-time is wasted while network bandwidth is saved. This is what my patch is doing.

I wanted to show you some results...
Without compression: 5.4 MB

With static gzip compression: 1.2 MB

4.4 MB saved each time you go to your cloud, and more importantly a few seconds from your users lives. I hope you like it :-) But I have to admit there is an obvious drawback: static compression does not compress dynamic content. This is not only true the ajax interactions, but also for the index html page dynamically generated by some JSP files, it weights about 200 KB.

Right solution: combine dynamic and static

I think the combination of static and dynamic compression is the perfect solution for the size problem. No more wasting CPU-time on static compression, no more wasting network on uncompressed dynamic content. Now you have both.

Thursday, 8 August 2013

Wikipedia loads

A bit off-topic for today. I wrote an app to follow Wikipedia loads in Kotlin and MongoDB, it was quite an interesting experience and I learned a lot from the experiment, it would be enough for some post, but what I really wanted to show you is interesting from other perspective: it is a visualization of some facts about internet users, languages and cultures.

On each chart, the black line represents traffic in request/hour and the red line is rendered from the averages of the traffic in that hour in several days.

English: around the clock

Usage graph of the english wikipedia

The English Wikipedia is a wiki built from over 6-million articles maintained by a very big and very active community. What is interesting for me in the English language is that the sun never sets on it. English is the official language of the United States with more than 300 million, Canada with 20 million, Australia with 21 million and United Kingdom with 60 million native english speakers. Also official language in India and smaller Asian countries and several African countries.
This gives that intereresting shape to the curve with several smaller peaks.
  • the big peak is at 18:00 UTC with roughly 14 million request/hour
  • the second peak is at 2:00 UTC with 12 million request/hour. t
  • the load never seems to go lower than 8 million request/hour (even that is huge) at around 7:00 UTC
  • the top load that I have seen is about 18 million request/hour

German: day use


German Wikipedia usage

Let's see my favorite industrial nation. Unlike English, German language is almost only spoken in Europe. This may be the reason why we see bigger ups and downs in the curve, the top of the average load is 2.1 million request/hour, but it is also changing day by day, the top activity you see is 4 million request/hour. That is a huge activity from the 120 million native German-speakers.

Hebrew: Sunday wiki

Hebrew Wikipedia usage

I chose hebrew from the small languages. While it is spoken by very small minorities in so many countries, it is only official and majority language in Israel. These folks have a very strange habbit: Friday is not a working day, but they work on Sunday. Saturday is the most sacred thing for religious Jewish people and they do not work.
Actually the low traffic that you see on the chart is not a Saturday. Saturdays are totally average days on the Hebrew Wikipedia and the top day is Sunday. Sunday is always over average.

Hungarian: The Two Towers

Hungarian Wikipedia usage

The other small language that I chose is Hungarian, my native language. (Did you notice my grammar mistakes?) The interesting thing in this curve is the two peaks of lunchtime and dinner (19:00 GMT, which is 6 PM in Hungary). I can't explain. Most people spend a little time checking mail, googling some stuff and reading Wikipedia at dinner? Anyway, usage after dinner falls dramatically.

Russian: Siberia


Russian Wikipedia usage graph
The last example is from Russian, I wanted to see a language which is spoken in 10 different timezones all across Europe and Asia. It does not show, very likely because of the population distribution of Russia, most Russians live in the European parts of Russia, while Siberia is almost uninhabited. Nice rivers, forests, mountains.

That's it for today, thanks for reading! I took the very last picture over the beautiful Siberia, I hope one day I will have a chance to see it from close. I mean without having to build a railway :-)

Thursday, 4 July 2013

String.split versus Pattern.split

I believe in java most people use String.split() to split a String to pieces. That method is there for ages (java 1.4), everyone knows it and it just works.
The alternative for this is to use a Pattern instance, which is immutable, therefore you only need a single instance of the pattern created only once and it can serve your applications forever. The guys who started to use it are in my opinion smarter, because they knew that the String.split actually needs to create a Pattern object, which is a relatively heavy operation and they can save it.

However, this is not the end of the story. It would be too short and would not make a blog post :-)

String.split() is smart, and it has a special case when the pattern is only one or two characters long and it does not contain special regexp characters. In this case it will not create a Pattern object, but simply process the whole thing in place. That special piece of code is not just accidentally there.

Let's see the speed comparison.
As you see, String.split performs better when the character you use for splitting meets the above requirements. When you need to split with several different characters - I believe this might be the less frequent case - you'd be much better using a Pattern constant.

Sunday, 31 March 2013

compression again - system vs java implementation

Last time I mentioned that using the operating system's compression utility (gzip on linux) performs better than the java counterpart even if you use it from java (the good old Process.exec()). This is not quite that simple of course :-) So in this test I compare the time needed to compress a random input both by system and java implementations. The size of the input is growing over the test, so does the time needed to compress, but there is something interesting.

So as you see the system gzip is faster, but it has a communication and process creation overhead. The java implementation is running without this overhead is therefore performing better with small inputs. The lines meet at about 512 KB. If the input is bigger, piping through a system command performs better.

This test was performed on Fedora 18 (the disasterous OS) x64 architecture, other architectures and operating systems may have different result.