Elasticsearch Benchmarking, Part 4: Inlining

Published 2024-03-01. Edited 2024-03-01 for structural clarity.

This post is the fourth of a series where we try to understand the details of a benchmark comparing Elasticsearch and OpenSearch, originally posted on the Elastic blog . In the last post we discovered that range queries and numerical sorting were ~80% faster in Elasticsearch when running the benchmark locally. This post will try to understand why.

tl;dr:

When benchmarking locally, range queries and numerical sorting was ~80% faster in Elasticsearch than OpenSearch. We can eliminate the difference by understanding JVM inlining, and properly warming up the JVM. We use this knowledge to make a Lucene change which seems to make some range queries and numerical sorting operations almost twice as fast in the benchmark setup.

Benchmarking Setup

The benchmark setup is the same as in the previous posts. I'm trying to replicate the benchmark posted by Elastic locally using Elasticsearch 8.11.0 (bundled with openjdk 21.0.1) and OpenSearch 2.11.0 (bundled with openjdk 17.0.8). Both setups are run (not concurrently) on the same hardware, a desktop with an AMD Ryzen 5 5600X CPU, 32 GB of RAM and a 2 TB NVMe SSD. The data used for the benchmark is fake logs - 1,740,763,348 documents in total. For more details on the setup, see the previous posts in this series.

What We Already Know

In the previous post, we saw that range queries and numerical sorting were reported as ~80% faster in Elasticsearch than in OpenSearch, in the big5 benchmark. Below are the relevant numbers from the benchmark:

Operation	Elasticsearch	OpenSearch	Diff %
range	1964.79 ms	3537.2 ms	+80.03%
keyword-in-range	1970.58 ms	3552.7 ms	+80.29%
sort_numeric_desc	729.779 ms	1308.88 ms	+79.35%
sort_numeric_asc	773.676 ms	825.3 ms	+6.67%

Range queries and numerical sorting from the big5 benchmark.

Lower is better

What did not make sense to me is why range queries and numerical sorting would be faster in Elasticsearch. The Lucene search library, which both Elasticsearch and OpenSearch uses, should be doing most of the computation. Let's try to understand where the speedup comes from.

The Range Query

The range query used for the range operation is a relatively straight-forward query on the date field:

{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "2023-01-01T00:00:00",
        "lt": "2023-01-03T00:00:00"
      }
    }
  }
}

Out of the 1,766,257,917 documents in the dataset, the range query matches 486,989,570 documents. It is worth noting that the data stream setup used by the benchmark does not use date-based indexing, the query has to be executed against all indices in the cluster.

We will now be switching to throughput metrics instead of 90th percentile latency. These are the throughput metrics for the range query reported by Rally in the big5 benchmark:

Operation	Elasticsearch	OpenSearch	Diff %
range	0.512 queries/s	0.286 queries/s	-44.14%

Throughput for the range query from the big5 benchmark.

Higher is better

Narrowing Down

In Rally, a benchmark is defined by a set of "operations" in a "schedule", forming a "challenge". I decided to try reproducing the behaviour with a range-only challenge:

{
  "name": "range-only",
  "default": false,
  "schedule": [
    {
      "operation": "range",
      "warmup-iterations": 10,
      "iterations": 100,
      "target-throughput": 2
    }
  ]
}

The only difference in the range schedule item from the original big5 benchmark is that I lowered the amount of warmup iterations. The results?

Operation	Elasticsearch	OpenSearch	Diff %
range	0.69 queries/s	0.7 queries/s	+1.45%

Throughput for the range query from the range-only challenge.

Higher is better

The range query is not faster in Elasticsearch when run in isolation. I investigated if this could be a tooling problem - e.g. Rally slowing down over time - but that was not the case.

Minimal Peproduction

The big5 challenge took around 2 hours to run, and was a bit hard to debug. I wanted to see how small a Rally challenge I could create that still slowed down the range query.

I started with the big5 challenge. I then removed warmup operations, and then reduced the operation count, for all operations, to 10. I could still see the range query slowdown.

I then started removing operations from the schedule, until the range query was no longer slow. The shortest schedule that I found that still slowed down the range query is this schedule, which I named big5rangedebug:

{
  "name": "big5rangedebug",
  "default": false,
  "schedule": [
    {
      "operation": "desc_sort_timestamp",
      "iterations": 10
    },
    {
      "operation": "composite-date_histogram-daily",
      "iterations": 10
    },
    {
      "operation": "range",
      "iterations": 10
    }
  ]
}

These are the results from running this schedule (with freshly started Elasticsearch/OpenSearch):

Operation	Elasticsearch	OpenSearch	Diff %
desc_sort_timestamp	0.96 queries/s	0.94 queries/s	-2.08%
composite-date_histogram-daily	0.04 queries/s	0.04 queries/s	0
range	0.42 queries/s	0.33 queries/s	-21.43%

Throughput values for the big5rangedebug challenge.

Higher is better

So, we can reproduce the slowdown of the original big5 schedule using the big5rangedebug schedule. Removing either the desc_sort_timestamp or the composite-date_histogram-daily operation from the schedule makes the range query run faster again. The operations in the schedule are in the same order as in the original big5 schedule (although in the original schedule, there are operations between them).

The desc_sort_timestamp operation is a search query that sorts the documents by the @timestamp field in descending order. The composite-date_histogram-daily operation is an aggregation that groups the documents by day. Both of their exact definitions can be found in the benchmarking repository provided by Elastic.

A side note here is that the composite-date_histogram-daily operation is heavily cached after the first run. This wont really matter for us - what matters is that it is run at least once.

Running the Range Query After the big5rangedebug Schedule

Let's go back to the single range query, in a variant with fewer iterations:

{
  "name": "range-only-short",
  "default": false,
  "schedule": [
    {
      "operation": "range",
      "iterations": 10
    }
  ]
}

If we run this schedule against the Elasticsearch/OpenSearch node without restarting it after the big5rangedebug schedule, we get similar results as the range query within the schedule:

Operation	Elasticsearch	OpenSearch	Diff %
Range - In big5rangedebug	0.42 queries/s	0.33 queries/s	-21.43%
Range - After big5rangedebug	0.45 queries/s	0.33 queries/s	-26.67%
Range - Fresh Node	0.65 queries/s	0.68 queries/s	+4.62%

Throughput for the range query after the big5rangedebug schedule compared to a fresh node.

Higher is better

So, we can conclude that running the range query after the other operations (desc_sort_timestamp and composite-date_histogram-daily) is slower on both Elasticsearch and OpenSearch - but the difference is much larger in OpenSearch.

How can the Operations Interact?

At this point, the obvious question is how first running the desc_sort_timestamp and composite-date_histogram-daily can slow down the range operation. I investigated whether this could be due to Garbage Collection pressure or caching issues, but could find nothing pointing in that direction.

There is, however, at least one other way that they can interact. All these queries make heavy use of Point Values in Lucene - they call a lot of similar methods. This could matter - due to the Java JIT compiler.

Recap - Benchmark Behaviour

Before diving into the JIT, let's recap what we have learnt from our benchmarking experiments.

Elasticsearch was faster at range queries when running the big5 schedule alone
Elasticsearch was not faster at range queries when run in isolation
Elasticsearch was faster at range queries when run after the big5rangedebug schedule. Having this minimal reproduction will help us understand what is happening.

JIT and Inlining

Java uses Just-In-Time compilation - it first runs bytecode in an interpreted mode, and gradually compiles the bytecode to machine code as it is used. Part of this process is inlining - where the JIT compiler replaces a method call with the code from that method.

This is one of the main reasons why benchmarks on the JVM usually include a warmup phase - to give the JIT compiler input enough to make good inlining decisions.

Running Without Inlining

Might it be that the way the JIT compiler inlines code could affect the benchmark results? I tried disabling it by setting the JVM flag -XX:-Inline , and running the big5rangedebug challenge again:

Operation	Elasticsearch	OpenSearch	Diff %
desc_sort_timestamp	0.55 queries/s	0.56 queries/s	+1.82%
composite-date_histogram-daily	0.02 queries/s	0.02 queries/s	0
range	0.26 queries/s	0.27 queries/s	+3.85%

Throughput values for the big5rangedebug challenge with inlining disabled.

Higher is better

Both Elasticsearch and OpenSearch runs slower with inlining disabled - but the real interesting thing is what happens when running only the range queries run on fresh (restarted) nodes. Let's compare our three different timings of range queries:

Operation	Elasticsearch	OpenSearch	Diff %
Range - After big5rangedebug	0.45 queries/s	0.33 queries/s	-26.67%
Range - Fresh Node	0.65 queries/s	0.68 queries/s	+4.62%
Range - Fresh Node, no inlining	0.28 queries/s	0.29 queries/s	+3.57%

Throughput for the range query comparing running it after big5rangedebug, with a fresh node and a fresh node without inlining.

Higher is better

Without inlining, the range queries runs at roughly the same speed in Elasticsearch and OpenSearch. This is roughly the same speed as OpenSearch when run as part of the big5rangedebug schedule. Inlining seems to be the key to Elasticsearch being faster after the big5rangedebug schedule.

This is not caused by the the different jdk versions being used by Elasticsearch ( 21.0.1 ) and OpenSearch ( 17.0.8 ) - I tried running OpenSearch with the same jdk as Elasticsearch - that did not change anything.

Understanding Inlining

Let's try to understand what inlining is happening. For these experiments I decided to use the same JDK ( openjdk-21.0.1 ) for both ES and OS, to minimize any noise.

I ran the big5rangedebug schedule against Elasticsearch and OpenSearch with the JVM flags -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining and collected the output. It was about 130 thousand lines for both ES and OS. We need to narrow that down.

How is the Range Query Executed?

In order to understand which method inlining:s might be of interesting to us, we can have a look at how the range query is executed. I collected a CPU profile while benchmarking the range query - let's have a look at it. The Elasticsearch and OpenSearch profiles are almost identical - the Lucene-level code is the interesting part. Therefore, I'll only show Elasticsearch:

Most of the CPU time is spent within the PointValues.intersect and BKDPointTree.addAll call trees . From what I can tell from the Lucene source code, it seems like the code recursively walks down an on-disk tree, which acts as an index for numeric values. Once it gets to the leaf, it adds all the values to a DocIdSet (usually a BitSet of document ids).

Finding the Relevant Inlining Output

By cross-referencing the CPU profile and the inlining output, I believe that I have found the inlining that we are looking for. The last inlining done of the BKDPointTree.addAll method correlates with the range query executing. This is the full output of that inlining in Elasticsearch. You don't have to read it all, but it's here for reference:

@ 76   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::addAll (98 bytes)   inline (hot)
  @ 5   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::size (163 bytes)   too big
  @ 20   org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$1::grow (13 bytes)   low call site frequency
  @ 20   org.apache.lucene.search.PointRangeQuery$1$1::grow (13 bytes)   low call site frequency
   \-> TypeProfile (5/19 counts) = org/apache/lucene/search/PointRangeQuery$1$1
   \-> TypeProfile (14/19 counts) = org/apache/lucene/search/comparators/NumericComparator$NumericLeafComparator$1
  @ 28   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::isLeafNode (17 bytes)   inline (hot)
  @ 39   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::getLeafBlockFP (40 bytes)   inline (hot)
  @ 42   org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl::seek (32 bytes)   inline (hot)
   \-> TypeProfile (29355/29355 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
    @ 1   org.apache.lucene.store.MemorySegmentIndexInput::ensureOpen (14 bytes)   inline (hot)
    @ 12   java.util.Objects::checkIndex (7 bytes)   force inline by annotation
      @ 3   jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
  @ 49   org.apache.lucene.store.DataInput::readVInt (114 bytes)   inline (hot)
   \-> TypeProfile (29332/29332 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
    @ 1   org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes)   inline (hot)
      @ 11   java.lang.foreign.MemorySegment::get (13 bytes)   force inline by annotation
       \-> TypeProfile (23678/23678 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
        @ 4   jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::accessHandle (26 bytes)   force inline by annotation
        @ 9   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force inline by annotation
          @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes)   force inline by annotation
          @ 46   java.lang.invoke.VarForm::getMemberName (31 bytes)   force inline by annotation
          @ 49   java.lang.invoke.VarHandleSegmentAsBytes::get (47 bytes)   force inline by annotation
            @ 14   java.lang.invoke.VarHandleSegmentAsBytes::checkAddress (21 bytes)   force inline by annotation
              @ 1   java.util.Objects::requireNonNull (14 bytes)   force inline by annotation
              @ 15   jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes)   force inline by annotation
                @ 26   jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes)   force inline by annotation
                  @ 16   jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
            @ 24   jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes)   accessor
            @ 29   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes)   inline (hot)
            @ 40   java.lang.invoke.VarHandleSegmentAsBytes::offsetNoVMAlignCheck (39 bytes)   force inline by annotation
              @ 1   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes)   accessor
              @ 13   jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes)   inline (hot)
            @ 43   jdk.internal.misc.ScopedMemoryAccess::getByte (16 bytes)   force inline by annotation
              @ 4   jdk.internal.misc.ScopedMemoryAccess::getByteInternal (34 bytes)   force inline by annotation
                @ 5   jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes)   force inline by annotation
                @ 13   jdk.internal.misc.Unsafe::getByte (0 bytes)   (intrinsic)
                @ 19   java.lang.ref.Reference::reachabilityFence (1 bytes)   force inline by annotation
    @ 17   org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes)   inline (hot)
      @ 11   java.lang.foreign.MemorySegment::get (13 bytes)   force inline by annotation
       \-> TypeProfile (23678/23678 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
        @ 4   jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::accessHandle (26 bytes)   force inline by annotation
        @ 9   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force inline by annotation
          @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes)   force inline by annotation
          @ 46   java.lang.invoke.VarForm::getMemberName (31 bytes)   force inline by annotation
          @ 49   java.lang.invoke.VarHandleSegmentAsBytes::get (47 bytes)   force inline by annotation
            @ 14   java.lang.invoke.VarHandleSegmentAsBytes::checkAddress (21 bytes)   force inline by annotation
              @ 1   java.util.Objects::requireNonNull (14 bytes)   force inline by annotation
              @ 15   jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes)   force inline by annotation
                @ 26   jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes)   force inline by annotation
                  @ 16   jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
            @ 24   jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes)   accessor
            @ 29   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes)   inline (hot)
            @ 40   java.lang.invoke.VarHandleSegmentAsBytes::offsetNoVMAlignCheck (39 bytes)   force inline by annotation
              @ 1   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes)   accessor
              @ 13   jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes)   inline (hot)
            @ 43   jdk.internal.misc.ScopedMemoryAccess::getByte (16 bytes)   force inline by annotation
              @ 4   jdk.internal.misc.ScopedMemoryAccess::getByteInternal (34 bytes)   force inline by annotation
                @ 5   jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes)   force inline by annotation
                @ 13   jdk.internal.misc.Unsafe::getByte (0 bytes)   (intrinsic)
                @ 19   java.lang.ref.Reference::reachabilityFence (1 bytes)   force inline by annotation
    @ 38   org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes)   too big
    @ 59   org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes)   too big
  @ 63   org.apache.lucene.util.bkd.DocIdsWriter::readInts (140 bytes)   inline (hot)
    @ 1   org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes)   inline (hot)
     \-> TypeProfile (15569/15569 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
      @ 11   java.lang.foreign.MemorySegment::get (13 bytes)   force inline by annotation
       \-> TypeProfile (23678/23678 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
        @ 4   jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::accessHandle (26 bytes)   force inline by annotation
        @ 9   java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes)   force inline by annotation
          @ 3   java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes)   force inline by annotation
          @ 46   java.lang.invoke.VarForm::getMemberName (31 bytes)   force inline by annotation
          @ 49   java.lang.invoke.VarHandleSegmentAsBytes::get (47 bytes)   force inline by annotation
            @ 14   java.lang.invoke.VarHandleSegmentAsBytes::checkAddress (21 bytes)   force inline by annotation
              @ 1   java.util.Objects::requireNonNull (14 bytes)   force inline by annotation
              @ 15   jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes)   force inline by annotation
                @ 26   jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes)   force inline by annotation
                  @ 16   jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
            @ 24   jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes)   accessor
            @ 29   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes)   inline (hot)
            @ 40   java.lang.invoke.VarHandleSegmentAsBytes::offsetNoVMAlignCheck (39 bytes)   force inline by annotation
              @ 1   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes)   accessor
              @ 13   jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes)   inline (hot)
            @ 43   jdk.internal.misc.ScopedMemoryAccess::getByte (16 bytes)   force inline by annotation
              @ 4   jdk.internal.misc.ScopedMemoryAccess::getByteInternal (34 bytes)   force inline by annotation
                @ 5   jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes)   force inline by annotation
                @ 13   jdk.internal.misc.Unsafe::getByte (0 bytes)   (intrinsic)
                @ 19   java.lang.ref.Reference::reachabilityFence (1 bytes)   force inline by annotation
    @ 109   org.apache.lucene.util.bkd.DocIdsWriter::readInts32 (39 bytes)   inline (hot)
      @ 7   org.apache.lucene.store.MemorySegmentIndexInput::readInts (57 bytes)   inline (hot)
        @ 14   java.lang.foreign.MemorySegment::copy (29 bytes)   force inline by annotation
          @ 1   java.util.Objects::requireNonNull (14 bytes)   force inline by annotation
          @ 7   java.util.Objects::requireNonNull (14 bytes)   force inline by annotation
          @ 12   java.util.Objects::requireNonNull (14 bytes)   force inline by annotation
          @ 25   jdk.internal.foreign.AbstractMemorySegmentImpl::copy (238 bytes)   force inline by annotation
            @ 2   java.lang.Object::getClass (0 bytes)   (intrinsic)
            @ 5   jdk.internal.foreign.AbstractMemorySegmentImpl::getBaseAndScale (189 bytes)   inline (hot)
              @ 3   java.lang.Object::equals (11 bytes)   inline (hot)
              @ 25   java.lang.Object::equals (11 bytes)   inline (hot)
              @ 47   java.lang.Object::equals (11 bytes)   inline (hot)
              @ 70   java.lang.Object::equals (11 bytes)   inline (hot)
            @ 12   java.lang.Object::getClass (0 bytes)   (intrinsic)
            @ 15   java.lang.Class::componentType (16 bytes)   inline (hot)
              @ 1   java.lang.Class::isArray (0 bytes)   (intrinsic)
            @ 19   jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::carrier (5 bytes)   accessor
            @ 79   jdk.internal.foreign.Utils::checkElementAlignment (17 bytes)   force inline by annotation
              @ 1   jdk.internal.foreign.Utils::isElementAligned (48 bytes)   force inline by annotation
                @ 27   jdk.internal.foreign.layout.AbstractLayout::byteAlignment (5 bytes)   accessor
                @ 33   jdk.internal.foreign.layout.AbstractLayout::byteSize (5 bytes)   accessor
            @ 86   jdk.internal.foreign.AbstractMemorySegmentImpl::isAlignedForElement (12 bytes)   force inline by annotation
              @ 3   jdk.internal.foreign.layout.AbstractLayout::byteAlignment (5 bytes)   accessor
              @ 8   jdk.internal.foreign.AbstractMemorySegmentImpl::isAlignedForElement (26 bytes)   force inline by annotation
                @ 1   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes)   accessor
                @ 7   jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes)   inline (hot)
            @ 113   jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes)   force inline by annotation
              @ 26   jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes)   force inline by annotation
                @ 16   jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
            @ 122   java.lang.reflect.Array::getLength (0 bytes)   (intrinsic)
            @ 125   java.util.Objects::checkFromIndexSize (8 bytes)   inline (hot)
              @ 4   jdk.internal.util.Preconditions::checkFromIndexSize (25 bytes)   inline (hot)
            @ 137   jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::order (5 bytes)   accessor
            @ 142   java.nio.ByteOrder::nativeOrder (4 bytes)   inline (hot)
            @ 148   jdk.internal.misc.ScopedMemoryAccess::getScopedMemoryAccess (4 bytes)   inline (hot)
            @ 153   jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes)   accessor
            @ 159   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes)   inline (hot)
            @ 164   jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes)   accessor
            @ 187   jdk.internal.misc.ScopedMemoryAccess::copyMemory (27 bytes)   force inline by annotation
              @ 12   jdk.internal.misc.ScopedMemoryAccess::copyMemoryInternal (56 bytes)   force inline by annotation
                @ 5   jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes)   force inline by annotation
                @ 28   jdk.internal.misc.Unsafe::copyMemory (33 bytes)   inline (hot)
                  @ 9   jdk.internal.misc.Unsafe::copyMemoryChecks (21 bytes)   inline (hot)
                    @ 3   jdk.internal.misc.Unsafe::checkSize (32 bytes)   inline (hot)
                    @ 9   jdk.internal.misc.Unsafe::checkPrimitivePointer (19 bytes)   inline (hot)
                      @ 3   jdk.internal.misc.Unsafe::checkPointer (19 bytes)   inline (hot)
                        @ 6   jdk.internal.misc.Unsafe::checkNativeAddress (28 bytes)   inline (hot)
                    @ 17   jdk.internal.misc.Unsafe::checkPrimitivePointer (19 bytes)   inline (hot)
                      @ 3   jdk.internal.misc.Unsafe::checkPointer (19 bytes)   inline (hot)
                        @ 15   jdk.internal.misc.Unsafe::checkOffset (32 bytes)   inline (hot)
                      @ 12   java.lang.Object::getClass (0 bytes)   (intrinsic)
                      @ 15   jdk.internal.misc.Unsafe::checkPrimitiveArray (22 bytes)   inline (hot)
                        @ 1   java.lang.Class::getComponentType (14 bytes)   inline (hot)
                          @ 1   java.lang.Class::isArray (0 bytes)   (intrinsic)
                        @ 10   java.lang.Class::isPrimitive (0 bytes)   (intrinsic)
                  @ 29   jdk.internal.misc.Unsafe::copyMemory0 (0 bytes)   (intrinsic)
                @ 32   java.lang.ref.Reference::reachabilityFence (1 bytes)   force inline by annotation
                @ 36   java.lang.ref.Reference::reachabilityFence (1 bytes)   force inline by annotation
      @ 27   org.apache.lucene.search.PointRangeQuery$1$1::visit (9 bytes)   inline (hot)
      @ 27   org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$1::visit (21 bytes)   inline (hot)
       \-> TypeProfile (1182090/2633788 counts) = org/apache/lucene/search/comparators/NumericComparator$NumericLeafComparator$1
       \-> TypeProfile (1451698/2633788 counts) = org/apache/lucene/search/PointRangeQuery$1$1
        @ 17   org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes)   inline (hot)
        @ 17   org.apache.lucene.util.DocIdSetBuilder$BufferAdder::add (24 bytes)   inline (hot)
         \-> TypeProfile (51012/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$BufferAdder
         \-> TypeProfile (87110/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
          @ 5   org.apache.lucene.util.FixedBitSet::set (56 bytes)   inline (hot)
        @ 5   org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes)   inline (hot)
         \-> TypeProfile (1347015/1347015 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
          @ 5   org.apache.lucene.util.FixedBitSet::set (56 bytes)   inline (hot)
  @ 70   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pushLeft (26 bytes)   inline (hot)
    @ 22   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::readNodeData (465 bytes)   hot method too big
  @ 76   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::addAll (98 bytes)   recursive inlining is too deep
  @ 80   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pop (21 bytes)   inline (hot)
  @ 84   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pushRight (87 bytes)   inline (hot)
    @ 56   org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl::seek (32 bytes)   inline (hot)
     \-> TypeProfile (8447/8447 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
      @ 1   org.apache.lucene.store.MemorySegmentIndexInput::ensureOpen (14 bytes)   inline (hot)
      @ 12   java.util.Objects::checkIndex (7 bytes)   force inline by annotation
        @ 3   jdk.internal.util.Preconditions::checkIndex (22 bytes)   (intrinsic)
    @ 83   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::readNodeData (465 bytes)   hot method too big
  @ 90   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::addAll (98 bytes)   recursive inlining is too deep
  @ 94   org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pop (21 bytes)   inline (hot)

For OpenSearch, the output is very similar. Let's focus on where they differ. In Elasticsearch, towards the bottom of the output within the DocIdsWriter::readInts32 method, we see this:

@ 27   org.apache.lucene.search.PointRangeQuery$1$1::visit (9 bytes)   inline (hot)
@ 27   org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$1::visit (21 bytes)   inline (hot)
 \-> TypeProfile (1182090/2633788 counts) = org/apache/lucene/search/comparators/NumericComparator$NumericLeafComparator$1
 \-> TypeProfile (1451698/2633788 counts) = org/apache/lucene/search/PointRangeQuery$1$1
  @ 17   org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes)   inline (hot)
  @ 17   org.apache.lucene.util.DocIdSetBuilder$BufferAdder::add (24 bytes)   inline (hot)
   \-> TypeProfile (51012/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$BufferAdder
   \-> TypeProfile (87110/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
    @ 5   org.apache.lucene.util.FixedBitSet::set (56 bytes)   inline (hot)
  @ 5   org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes)   inline (hot)
   \-> TypeProfile (1347015/1347015 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
    @ 5   org.apache.lucene.util.FixedBitSet::set (56 bytes)   inline (hot)

However, in OpenSearch we instead see this:

@ 27   org.apache.lucene.index.PointValues$IntersectVisitor::visit (0 bytes)   virtual call

It seems like Elasticsearch inlines the PointValues$IntersectVisitor::visit calls, while OpenSearch instead performs a virtual call. What does this mean?

Polymorphism in BKDPointTree

So, what's happening here? Higher up in the call stack, the BKDPointTree.addAll method takes a PointValues.IntersectVisitor as input:

public void addAll(
  PointValues.IntersectVisitor visitor,
  boolean grown
) throws IOException {...

The full source of this method is available in the Lucene GitHub repository.

PointValues.IntersectVisitor is an interface. This instance is then sent down all the way down to the DocIdsWriter::readInts32 method:

private void readInts32(
  IndexInput in,
  int count,
  IntersectVisitor visitor
) throws IOException {
  in.readInts(scratch, 0, count);
  for (int i = 0; i < count; i++) {
    visitor.visit(scratch[i]);
  }
}

Where the visit method is called. For the range query, I believe that this happens a bit more than once per matched document (over all indices, we matched 486,989,570 documents).

In OpenSearch, the visit method gets called as a virtual call, while it is inlined as direct call to either PointRangeQuery$1$1.visit or NumericComparator$NumericLeafComparator$1 in Elasticsearch. The reason for this has to do with how it figures out which actual implementation of the visit method to call.

Polymorphic Inlining in the JVM

The JVM uses a few rules to determine how to inline polymorphic method calls - methods where more than one implementation exists.

If the method has only been called with one implementation, it will inline that implementation. This is known as monomorphic inlining.
If the method has been called with two implementations, it can inline both implementations - essentially treating it as an if/else statement. This is known as bimorphic inlining.
If the method has been called with three or more implementations, it will not inline it. This is known as megamorphic inlining.

For a more thorough discussion of this, I recommend this blog post written by Aleksey Shipilëv - an OpenJDK engineer. Specifically the section "Three types and beyond" gives some really good insights. The performance numbers from his micro-benchmarks also do seem to match what we have seen in our query benchmarks.

Why Different Inlining?

So, why are different inlinings done in Elasticsearch than in OpenSearch?

In Elasticsearch, the IntersectVisitor.visit method is called for two implementations during the benchmark - NumericComparator$NumericLeafComparator$1 and PointRangeQuery$1$1 , meaning that it can be inlined. In OpenSearch, the visit method is called with NumericComparator$NumericLeafComparator$1 , PointRangeQuery$1$1 and PointRangeQuery$1$2 (note the " 2 " at the end) - three of them.

The NumericComparator$NumericLeafComparator$1 is used when sorting document by the timestamp field, in the desc_sort_timestamp operation. The PointRangeQuery$1$1 and PointRangeQuery$1$2 are both used for range queries. Lucene uses different implementation depending on if it thinks that it is cheaper to find matching documents or, finding non-matching documents and then inverting the result.

The range query in the range operation ends up using the inclusive implementation ( PointRangeQuery$1$1 ) in both Elasticsearch and OpenSearch. However, the range query in composite-date_histogram-daily matches more documents, and thus ends up using the exclusive implementation ( PointRangeQuery$1$2 ) in OpenSearch.

But why does Elasticsearch not end up using the exclusive implementation in our benchmark? The reason is that Elasticsearch is slightly smarter about optimising requests with size 0 where the range query and the histogram both use the same numeric field. It ends up not executing the query at all for thecomposite-date_histogram (it instead applies the range restriction as part of the aggregation).

Forcing Elasticsearch to use Virtual Calls

Can we make Elasticsearch use virtual calls to IntersectVisitor.visit when running the range query? Yes - by forcing the execution of the range query in the composite-date_histogram-daily operation in Elasticsearch by changing the request to ask for 1 (or more) documents to be returned:

{
  "size": 1,
  "query": {
    "range": {
      "@timestamp": {
        "gte": "2022-12-30T00:00:00",
        "lt": "2023-01-07T12:00:00"
      }
    }
  },
  "aggs": {
    "logs": {
      "composite": {
        "sources": [
          {
            "date": {
              "date_histogram": {
                "field": "@timestamp",
                "calendar_interval": "day"
              }
            }
          }
        ]
      }
    }
  }
}

The only difference from before is changing the size parameter to be 1 instead of 0. Lets create a copy of the Rally track big5rangedebug schedule, named big5rangedebug-histo-size-1 , using this new version of the composite-date-histogram-daily operation:

Operation	Elasticsearch	OpenSearch	Diff %
desc_sort_timestamp	0.91 queries/s	0.87 queries/s	-4.40%
composite-date_histogram-daily-size-1	0.04 queries/s	0.04 queries/s	0
range	0.29 queries/s	0.33 queries/s	+13.79%

Throughput values for the big5rangedebug-histo-size-1 schedule.

Higher is better

As we suspected, adding the size parameter to the composite-date-histogram-daily operation resulted in the range query running slower in Elasticsearch, matching the setup in OpenSearch. This strengthens our hypothesis that the inlining is the root cause of the difference in range query performance in our local benchmarks.

Recap - JIT Inlining

Inlining in the JVM is not easy to understand. Let's look at what we have learnt in this section.

The JIT will often inline method calls into the calling method - but not three or more implementations of the called method is from the call site in the calling method.
This matters to us due to Elasticsearch happening to use two implementations of the IntersectVisitor.visit while OpenSearches uses three. In real world usage, at least three implementations would likely be used by both ES and OS, making the OpenSearch numbers more realistic.
By forcing Elasticsearch to use three implementations, Elasticsearch stops being faster at the range query.

Returning to the big5 Benchmark

So far we have done small experiments to understand the range query performance. We have however used very low iteration counts - I would not use this as a basis for any conclusions. Instead we will return to the original big5 benchmark.

However, this time, we will first run the big5rangedebug-histo-size-1 schedule - in order to trigger the virtual call inlining before the main benchmark run. From what we have learnt, we should expect that the range query will run at about the same speed in Elasticsearch and OpenSearch. These are the relevant results from the big5schedule (again in 90th percentile service times):

Operation	Elasticsearch	OpenSearch	Diff %
range	3274.09 ms	3502.62 ms	+6.98%
keyword-in-range	3292.79 ms	3521.31 ms	+6.94%
sort_numeric_desc	1075.16 ms	1016.13 ms	-5.49%
sort_numeric_asc	841.639 ms	819.042 ms	-2.68%

big5 schedule results after first running the big5rangedebug-histo-size-1 schedule once.

Lower is better

It seems like our predictions were correct. By first triggering the inlining to happen with the virtual call, OpenSearch now ran about as fast as Elasticsearch. I also tried doing a similar example with the fullbig5 schedule as warmup, and saw similar results.

Elasticsearch using one fewer implementation of the visit method than OpenSearch in the big5 benchmark is the only reason why Elasticsearch was faster at the range, keyword-in-range and sort_numeric_desc operations in this benchmark. This would likely not happen in production.

The sort_numeric_asc does not seem as obviously affected though. It might be interesting to dig into why, but this post is already long enough.

Recap - Rerunning big5

After running the big5 schedule with our warmups, we have learnt some things:

By forcing three implementations of IntersectVisitor.visit in Elasticsearch, it no longer outperforms OpenSearch in the big5 schedule.
This has less to do with Elasticsearch vs. OpenSearch and more to do with how many IntersectVisitor implementations had been used before the main benchmark.

Improving Things

So far in this post, we have been comparing how Elasticsearch and OpenSearch. Let's switch gears to instead see how we can use what we have learnt to improve Lucene - benefiting both Elasticsearch and OpenSearch.

Changing the readInts32 method

Remember the readInts32 method in the DocIdsWriter class? Here it is, again:

private void readInts32(
  IndexInput in,
  int count,
  IntersectVisitor visitor
) throws IOException {
  in.readInts(scratch, 0, count);
  for (int i = 0; i < count; i++) {
    visitor.visit(scratch[i]);
  }
}

I did a small, change to this method (and the very similar readDelta16 method). The idea is to make fewer virtual calls by instead calling another version of the visit method, which takes a DocIdSetIterator as input. This is the conceptual change:

private void readInts32(
  IndexInput in,
  int count,
  IntersectVisitor visitor
) throws IOException {
  in.readInts(scratch, 0, count);
  DocIdSetIterator disi = docIdSetIterator(scratch, count);
  visitor.visit(disi);
}

Lucene chooses 512 as the maximum number of range values to store in each leaf of the BKD tree.

// BKDConfig.java
public static final int DEFAULT_MAX_POINTS_IN_LEAF_NODE = 512;

Which means that the count parameter will be at most 512. I debugged our benchmark to verify this, and it is indeed almost always 512. This means that we should be doing 512 times fewer virtual calls in this method.

The full change required a bit more boilerplate, I forked Lucene and committed it to GitHub.. The commit is on top of the main branch, but also applies cleanly to the 9.8.0 release, which Elasticsearch 8.11.0 uses.

Running Benchmarks with the Change

I built Lucene 9.8.0 with my change applied, used that to once again run the big5 schedule. Again, I ran it after the big5rangedebug-histo-size-1 schedule to trigger the virtual call inlining.

Now, we are comparing results between Elasticsearch and Elasticsearch with the Lucene change applied (I expect OpenSearch would see similar results). Some operations saw no significant change, these are the ones that did. As usual, in 90th percentile service times:

Operation	Elasticsearch	Elasticsearch with patch	Diff %
desc_sort_timestamp	1707.76 ms	1013.44 ms	-40.66%
asc_sort_timestamp	1814.87 ms	1057.4 ms	-41.74%
desc_sort_with_after_timestamp	1493.22 ms	889.041 ms	-40.46%
asc_sort_with_after_timestamp	1775.64 ms	1036.73 ms	-41.61%
desc_sort_timestamp_can_match_shortcut	245.554 ms	153.338 ms	-37.55%
desc_sort_timestamp_no_can_match_shortcut	245.001 ms	155.004 ms	-36.73%
asc_sort_timestamp_can_match_shortcut	282.434 ms	170.331 ms	-39.69%
asc_sort_timestamp_no_can_match_shortcut	286.081 ms	182.341 ms	-36.26%
range	3274.09 ms	1858.64 ms	-43.23%
keyword-in-range	3292.79 ms	1789.29 ms	-45.66%
query-string-on-message-filtered	1280.72 ms	1013.25 ms	-20.88%
query-string-on-message-filtered-sorted-num	459.897 ms	328.508 ms	-28.57%
sort_numeric_desc	1075.16 ms	797.731 ms	-25.80%
sort_numeric_asc	841.639 ms	823.912 ms	-2.11%

Comparing big5 schedule results with and without our change, after first running the big5rangedebug-histo-size-1 schedule once.

Lower is better

In many operations, we see a response time reduction of 40%. In other words almost twice as fast. This is crazy. I did not trust it enough, so I re-ran the benchmark twice, also adding in an additional big5 schedule as an additional warmup. The results were consistent. The same also holds true for the 50th percentile.

It should be noted that the patch seems to bring down the response times to the same level as when running without the warmup - i.e. it seems to undo the performance degradation caused by the virtual calls (as opposed to to the inlining that made Elasticsearch faster than OpenSearch). Of course, in real life scenarios, it is likely that the Elasticsearch (or OpenSearch) JVM have been warmed up enough for the virtual call to happen.

Now, this is only one benchmark - it might that the patch has a larger impact on our data setup. On the other hand, it's very likely not a performance degradation in general. I will create a Pull Request to Lucene with it, and see what the maintainers think.

Edit 2024-03-024: Pull Request created.

Recap - Making Lucene Faster

We made a small change to Lucene - what has that taught us?

In Lucene, reducing the amount of the amount of virtual calls to the IntersectVisitor made range queries almost twice as fast when running in a JVM with three used IntersectVisitor.visit implementations.
This is due to the virtual calls happening so often. Without the patch, it would be at least ~486,989,570 virtual calls per search (the amount of matched documents). With the patch, this number is divided by ~512 (the number of BKD Points in leafs).
We verified this to be true in ElasticSearch, but expect the same to be true for OpenSearch.

Conclusions

We have dug ourselves quite deep into a rabbit hole of JVM inlining - but we have learnt some important insights.

Impact of previous operations

The range query execution time in the big5 benchmark was heavily affected by which operations had run before it since the last restart. This ended up favouring Elasticsearch in the big5 benchmark for very non-obvious reasons.

In the original benchmark done by Elastic was run 50 times - with no mention of restarting the nodes between each run. I thus believe that their results are minimally (if at all) affected by this favouring issue - and I should be clear with saying that I don't think this was malicious behaviour by Elastic.

Running benchmarks on fresh nodes is better for reproducibility of benchmark. But long running nodes are probably a more realistic production scenario; in real-world production clusters of Elasticsearch and OpenSearch, I find it quite likely that the BKDPointTree::addAll would have been called with at least 3 implementations. This should be part of the warm-up before running range or numerical sorting benchmarks.

The range query is probably as fast in OpenSearch as in Elasticsearch for realistic use. This is to be expected, since Lucene does the heavy lifting.

General Benchmark Learnings

"Warming up" a JVM for a benchmark is generally seen as a way to speed up subsequent operations, without regard for that it might also slow down other operations.

The main purpose of benchmarks is to predict behaviour in production environments. Thus, rather than seeing the JVM warm-up as a way to speed up operations, it should be seen as a way to get the JVM into a state reasonably similar to what it would be in a production environment.

If you are building libraries, such as Lucene, this is hard to get right. If you are building applications, you can warm up the JVM by running representative workloads on the application that you are building. One way to bypass this issue is to instrument how you code behaves in production. Blunders.io can help with that.

Small Changes can have a Big Impact

We also saw that a very small change to the Lucene codebase could make some queries almost twice as fast. I find this phenomenon quite common - if you have a good hypothesis of what is taking time, tiny changes can often have as large of an impact as fancy algorithms.

I hope that you have enjoyed this post. Blunders.io can help you bypass the inlining issues, or find what small changes can have a big performance impact for you, by profiling in production. If you want to learn more, or have any questions about this post, please reach out to me at anton@blunders.io.