Let's have a list of Strings with some duplication, like (Bob, Sue, Joe, Bob, John, etc). Typically when you read the input from a database or from JSON on a REST API, all strings are a separate object, even if they are the same - like 'Bob' in the example.
Now the question is: how does it change the performance of an algorithm if we de-duplicate this list, therefore each string with the same value should actually be the same String object.
Why would it be any different? Because when the algorithm needs to access the data, it is less likely that it will miss the cache built into the CPU and therefore has to wait.
My benchmark does the same algorithm (sort) first on the non-deduplicated data, and then on the de-duplicated data, on this benchmark we can see the rate of the two results. Of course 1 means they are the same. The different colors show how many different string values there were in the list.
So when does it make sense to deduplicate?
- Will we elliminate any duplicates?
- What amount of data is there there be?
- What is the cost of the deduplication? - e.g. Strings have a constant cost with String.intern
- For how long / how many times can we enjoy not missing the cache? - typically if you just get the data through JSON and insert it into the DB, then you do not save much time. If you keep the data in memory and process it very frequently, then it with each processing we save a little time
- Also depends on what type of processing you do, in case of sorting you have to access the data quickly many times after each other. In case of e.g. JSON serialization with jackson, you need to access each data only once, and you can expect smaller difference in performance.
Benchmark source code published on github. Feel free to try, but I have to warn you that it takes 28 hours to run this benchmark.