2020-12-13 - A Tale of Two Integers

In this post, we delve into the performance implications of memory by looking into humble int arrays.

There are quite a few numerical types in Java. Most of them comes in pairs: short/Short, int/Integer, long/Long. In this post we'll look at the int/Integer pair.

On ints and Integers

So, what is the difference between an int and an Integer, apart from the latter being able to brag about having a capital 'I'? You probably know that int is a primitive, and that Integer is an Object. It thus follows that Integer references can be null, and that Integers can be used as generics in e.g. collections such as List.

Integers are much more convenient to use. So, why does Java even have ints? After all, Ruby gets away with doing just that. The answer is performance. Java can, contrary to what angry c++ programmers might tell you, be used to write fairly fast programmes.

Benchmarking

To look into this a bit more, let's forget all about maths in this example conceived to make a point. We have a pair of programs that creates an array of integers and sums it up.

Initialize and sum using int[]:

int count = 10_000_000;
int[] ints = new int[count];
for (int i = 0; i < count; i++) {
  ints[i] = i;
}
long sum = 0;
for (int i : ints) {
  sum += i;
}

Initialize and sum using Integer[]:

int count = 10_000_000;
Integer[] integers = new Integer[count];
for (int i = 0; i < count; i++) {
    integers[i] = new Integer(i);
}
long sum = 0;
for (Integer i : integers) {
    sum += i;
}

On my laptop, the int[] version averages 17.91 ms per round. The Integer[] version averages 364.47 ms per round. Thus, the Integer[] version is about 20x slower.

So, why is that? Since the Integer[] allocates 1 000 001 objects per round, the Garbage Collector has to work very very hard. In this case I used the G1 collector, which according to the GC logs spent 170% of the wallclock time in pauses. Allocating and garbage collecting objects is expensive. The int[] version allocated 1 object, the int[] array.

So, what happens if we only sum the integers? In this example, the arrays are already created. This should us benchmark something else than the Garbage Collector.

Sum using int[]:

// ints is a int[10_000_000]
long sum = 0;
for (int i : ints) {
  sum += i;
}

Sum using Integer[]:

// integers is a Integer[10_000_000]
long sum = 0;
for (Integer i : integers) {
    sum += i;
}

The Integer[] version now averages 23.86 ms per round. Great improvement! But the int[] version now averages 5.19 ms per round. About 4x faster.

Memory sizes

So, why is the int[] version faster now? After all, adding two integers is a single CPU instruction. The answer is memory usage. We can use JOL to see how much memory the different versions use:

// import org.openjdk.jol.info.GraphLayout;

int count = 10_000_000;
Integer[] integers = new Integer[count];
for (int i = 0; i < count; i++) {
    integers[i] = new Integer(i);
}
int[] ints = new int[count];
for (int i = 0; i < count; i++) {
    ints[i] = i;
}
System.out.printf(
        "Ints size: %d, Integer[] size: %d%n",
        GraphLayout.parseInstance(ints).totalSize(),
        GraphLayout.parseInstance(integers).totalSize()
);
// output: Ints size: 40000016, Integer[] size: 200000016

The int[] uses 40 000 016 bytes, while the Integer[] uses 200 000 016. For both arrays, the array object header uses 16 bytes. In the int[] case, we then use 4 bytes per value. For the Integer[], we use 4 bytes per Integer reference/pointer in the array. This could be higher depending on whether or not compressed OOPs can be used. Finally, 16 bytes are used per Integer object.

The Integer[] case thus needs to read ~5 times as much memory. This is important, since reading from memory is much, much slower than executing an add instruction. In fact, most of the time spent in our second benchmark is spent reading from memory. The 5x memory reading increase is very close to the 4x speed decrease.

CPU Caches tries to make reading memory faster. However, since both the int[] and Integer[] examples allocated more memory than the 4 MB L3 cache of my laptop processor, caching likely had limited effects in this example.

Interestingly, most profilers can't tell the difference between the CPU executing instructions and the the CPU reading from memory. Thus, even when tooling reports 100% CPU usage, your bottleneck can be memory usage.

Is this real?

Needless to say, these examples are contrived. But the costs are real. Object allocation and memory access affects the performance of real-world software. When I work with performance, I tend to focus on memory usage and allocations first.

In the game industry data oriented design is used to improve performance. The main idea is to optimize data structures to better utilize CPU caches and reduce the amount of memory read. I.e, addressing the issues we saw in our benchmarks. There is nothing limiting the techniques of data oriented design to game development, it can be used in any application.

Data oriented design does not have to be an all-or-nothing thing. Reducing memory allocations and optimizing your data layout to improve the cache hit rate can be done gradually, and should be applied in the parts of the application where it is needed the most.

How to reduce memory usage

The first step to reduce memory usage is to profile allocations - if you don't know which part of your programme is allocating memory you do not know what parts to improve. Pre-maturely optimizing the wrong thing is.. bad. Blunders is a profiler, but there are many others. Use one that can tell you about memory allocations.

Once you know which parts is slow, try changing the algorithm to use less memory. Sometimes that's as simple as removing a log statement, sometimes it requires rewriting significant parts of the software.

If you have a lot of collections with primitives, look into using fastutil or similar to reduce memory usage.

Summary

In this post, we have used a toy example to show just how much effect object allocation and memory size can have on performance.