Published 2024-03-01. Edited 2024-03-01 for structural clarity.
This post is the fourth of a series where we try to understand the details of a benchmark comparing Elasticsearch and OpenSearch, originally posted on the Elastic blog . In the last post we discovered that range queries and numerical sorting were ~80% faster in Elasticsearch when running the benchmark locally. This post will try to understand why.
When benchmarking locally, range queries and numerical sorting was ~80% faster in Elasticsearch than OpenSearch. We can eliminate the difference by understanding JVM inlining, and properly warming up the JVM. We use this knowledge to make a Lucene change which seems to make some range queries and numerical sorting operations almost twice as fast in the benchmark setup.
The benchmark setup is the same as in the previous posts. I'm trying to replicate the benchmark posted by Elastic locally using Elasticsearch 8.11.0 (bundled with openjdk 21.0.1) and OpenSearch 2.11.0 (bundled with openjdk 17.0.8). Both setups are run (not concurrently) on the same hardware, a desktop with an AMD Ryzen 5 5600X CPU, 32 GB of RAM and a 2 TB NVMe SSD. The data used for the benchmark is fake logs - 1,740,763,348 documents in total. For more details on the setup, see the previous posts in this series.
In the previous post, we saw that range queries and numerical sorting were reported as ~80% faster in Elasticsearch than in OpenSearch, in the big5 benchmark. Below are the relevant numbers from the benchmark:
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
range | 1964.79 ms | 3537.2 ms | +80.03% |
keyword-in-range | 1970.58 ms | 3552.7 ms | +80.29% |
sort_numeric_desc | 729.779 ms | 1308.88 ms | +79.35% |
sort_numeric_asc | 773.676 ms | 825.3 ms | +6.67% |
Range queries and numerical sorting from the big5 benchmark.
Lower is better
What did not make sense to me is why range queries and numerical sorting would be faster in Elasticsearch. The Lucene search library, which both Elasticsearch and OpenSearch uses, should be doing most of the computation. Let's try to understand where the speedup comes from.
The range query used for the range operation is a relatively straight-forward query on the date field:
{
"query": {
"range": {
"@timestamp": {
"gte": "2023-01-01T00:00:00",
"lt": "2023-01-03T00:00:00"
}
}
}
}
Out of the 1,766,257,917 documents in the dataset, the range query matches 486,989,570 documents. It is worth noting that the data stream setup used by the benchmark does not use date-based indexing, the query has to be executed against all indices in the cluster.
We will now be switching to throughput metrics instead of 90th percentile latency. These are the throughput metrics for the range query reported by Rally in the big5 benchmark:
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
range | 0.512 queries/s | 0.286 queries/s | -44.14% |
Throughput for the range query from the big5 benchmark.
Higher is better
In Rally, a benchmark is defined by a set of "operations" in a "schedule", forming a "challenge". I decided to try reproducing the behaviour with a range-only challenge:
{
"name": "range-only",
"default": false,
"schedule": [
{
"operation": "range",
"warmup-iterations": 10,
"iterations": 100,
"target-throughput": 2
}
]
}
The only difference in the range schedule item from the original big5 benchmark is that I lowered the amount of warmup iterations. The results?
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
range | 0.69 queries/s | 0.7 queries/s | +1.45% |
Throughput for the range query from the range-only challenge.
Higher is better
The range query is not faster in Elasticsearch when run in isolation. I investigated if this could be a tooling problem - e.g. Rally slowing down over time - but that was not the case.
The big5 challenge took around 2 hours to run, and was a bit hard to debug. I wanted to see how small a Rally challenge I could create that still slowed down the range query.
I started with the big5 challenge. I then removed warmup operations, and then reduced the operation count, for all operations, to 10. I could still see the range query slowdown.
I then started removing operations from the schedule, until the range query was no longer slow. The shortest schedule that I found that still slowed down the range query is this schedule, which I named big5rangedebug:
{
"name": "big5rangedebug",
"default": false,
"schedule": [
{
"operation": "desc_sort_timestamp",
"iterations": 10
},
{
"operation": "composite-date_histogram-daily",
"iterations": 10
},
{
"operation": "range",
"iterations": 10
}
]
}
These are the results from running this schedule (with freshly started Elasticsearch/OpenSearch):
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
desc_sort_timestamp | 0.96 queries/s | 0.94 queries/s | -2.08% |
composite-date_histogram-daily | 0.04 queries/s | 0.04 queries/s | 0 |
range | 0.42 queries/s | 0.33 queries/s | -21.43% |
Throughput values for the big5rangedebug challenge.
Higher is better
So, we can reproduce the slowdown of the original big5 schedule using the big5rangedebug schedule. Removing either the desc_sort_timestamp or the composite-date_histogram-daily operation from the schedule makes the range query run faster again. The operations in the schedule are in the same order as in the original big5 schedule (although in the original schedule, there are operations between them).
The desc_sort_timestamp operation is a search query that sorts the documents by the @timestamp field in descending order. The composite-date_histogram-daily operation is an aggregation that groups the documents by day. Both of their exact definitions can be found in the benchmarking repository provided by Elastic.
A side note here is that the composite-date_histogram-daily operation is heavily cached after the first run. This wont really matter for us - what matters is that it is run at least once.
Let's go back to the single range query, in a variant with fewer iterations:
{
"name": "range-only-short",
"default": false,
"schedule": [
{
"operation": "range",
"iterations": 10
}
]
}
If we run this schedule against the Elasticsearch/OpenSearch node without restarting it after the big5rangedebug schedule, we get similar results as the range query within the schedule:
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
Range - In big5rangedebug | 0.42 queries/s | 0.33 queries/s | -21.43% |
Range - After big5rangedebug | 0.45 queries/s | 0.33 queries/s | -26.67% |
Range - Fresh Node | 0.65 queries/s | 0.68 queries/s | +4.62% |
Throughput for the range query after the big5rangedebug schedule compared to a fresh node.
Higher is better
So, we can conclude that running the range query after the other operations (desc_sort_timestamp and composite-date_histogram-daily) is slower on both Elasticsearch and OpenSearch - but the difference is much larger in OpenSearch.
At this point, the obvious question is how first running the desc_sort_timestamp and composite-date_histogram-daily can slow down the range operation. I investigated whether this could be due to Garbage Collection pressure or caching issues, but could find nothing pointing in that direction.
There is, however, at least one other way that they can interact. All these queries make heavy use of Point Values in Lucene - they call a lot of similar methods. This could matter - due to the Java JIT compiler.
Before diving into the JIT, let's recap what we have learnt from our benchmarking experiments.
Java uses Just-In-Time compilation - it first runs bytecode in an interpreted mode, and gradually compiles the bytecode to machine code as it is used. Part of this process is inlining - where the JIT compiler replaces a method call with the code from that method.
This is one of the main reasons why benchmarks on the JVM usually include a warmup phase - to give the JIT compiler input enough to make good inlining decisions.
Might it be that the way the JIT compiler inlines code could affect the benchmark results? I tried disabling it by setting the JVM flag -XX:-Inline , and running the big5rangedebug challenge again:
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
desc_sort_timestamp | 0.55 queries/s | 0.56 queries/s | +1.82% |
composite-date_histogram-daily | 0.02 queries/s | 0.02 queries/s | 0 |
range | 0.26 queries/s | 0.27 queries/s | +3.85% |
Throughput values for the big5rangedebug challenge with inlining disabled.
Higher is better
Both Elasticsearch and OpenSearch runs slower with inlining disabled - but the real interesting thing is what happens when running only the range queries run on fresh (restarted) nodes. Let's compare our three different timings of range queries:
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
Range - After big5rangedebug | 0.45 queries/s | 0.33 queries/s | -26.67% |
Range - Fresh Node | 0.65 queries/s | 0.68 queries/s | +4.62% |
Range - Fresh Node, no inlining | 0.28 queries/s | 0.29 queries/s | +3.57% |
Throughput for the range query comparing running it after big5rangedebug, with a fresh node and a fresh node without inlining.
Higher is better
Without inlining, the range queries runs at roughly the same speed in Elasticsearch and OpenSearch. This is roughly the same speed as OpenSearch when run as part of the big5rangedebug schedule. Inlining seems to be the key to Elasticsearch being faster after the big5rangedebug schedule.
This is not caused by the the different jdk versions being used by Elasticsearch ( 21.0.1 ) and OpenSearch ( 17.0.8 ) - I tried running OpenSearch with the same jdk as Elasticsearch - that did not change anything.
Let's try to understand what inlining is happening. For these experiments I decided to use the same JDK ( openjdk-21.0.1 ) for both ES and OS, to minimize any noise.
I ran the big5rangedebug schedule against Elasticsearch and OpenSearch with the JVM flags -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining and collected the output. It was about 130 thousand lines for both ES and OS. We need to narrow that down.
In order to understand which method inlining:s might be of interesting to us, we can have a look at how the range query is executed. I collected a CPU profile while benchmarking the range query - let's have a look at it. The Elasticsearch and OpenSearch profiles are almost identical - the Lucene-level code is the interesting part. Therefore, I'll only show Elasticsearch:
Most of the CPU time is spent within the PointValues.intersect and BKDPointTree.addAll call trees . From what I can tell from the Lucene source code, it seems like the code recursively walks down an on-disk tree, which acts as an index for numeric values. Once it gets to the leaf, it adds all the values to a DocIdSet (usually a BitSet of document ids).
By cross-referencing the CPU profile and the inlining output, I believe that I have found the inlining that we are looking for. The last inlining done of the BKDPointTree.addAll method correlates with the range query executing. This is the full output of that inlining in Elasticsearch. You don't have to read it all, but it's here for reference:
@ 76 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::addAll (98 bytes) inline (hot)
@ 5 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::size (163 bytes) too big
@ 20 org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$1::grow (13 bytes) low call site frequency
@ 20 org.apache.lucene.search.PointRangeQuery$1$1::grow (13 bytes) low call site frequency
\-> TypeProfile (5/19 counts) = org/apache/lucene/search/PointRangeQuery$1$1
\-> TypeProfile (14/19 counts) = org/apache/lucene/search/comparators/NumericComparator$NumericLeafComparator$1
@ 28 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::isLeafNode (17 bytes) inline (hot)
@ 39 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::getLeafBlockFP (40 bytes) inline (hot)
@ 42 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl::seek (32 bytes) inline (hot)
\-> TypeProfile (29355/29355 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
@ 1 org.apache.lucene.store.MemorySegmentIndexInput::ensureOpen (14 bytes) inline (hot)
@ 12 java.util.Objects::checkIndex (7 bytes) force inline by annotation
@ 3 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 49 org.apache.lucene.store.DataInput::readVInt (114 bytes) inline (hot)
\-> TypeProfile (29332/29332 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
@ 1 org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes) inline (hot)
@ 11 java.lang.foreign.MemorySegment::get (13 bytes) force inline by annotation
\-> TypeProfile (23678/23678 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
@ 4 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::accessHandle (26 bytes) force inline by annotation
@ 9 java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes) force inline by annotation
@ 3 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes) force inline by annotation
@ 46 java.lang.invoke.VarForm::getMemberName (31 bytes) force inline by annotation
@ 49 java.lang.invoke.VarHandleSegmentAsBytes::get (47 bytes) force inline by annotation
@ 14 java.lang.invoke.VarHandleSegmentAsBytes::checkAddress (21 bytes) force inline by annotation
@ 1 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 15 jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes) force inline by annotation
@ 26 jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes) force inline by annotation
@ 16 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 24 jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes) accessor
@ 29 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes) inline (hot)
@ 40 java.lang.invoke.VarHandleSegmentAsBytes::offsetNoVMAlignCheck (39 bytes) force inline by annotation
@ 1 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes) accessor
@ 13 jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes) inline (hot)
@ 43 jdk.internal.misc.ScopedMemoryAccess::getByte (16 bytes) force inline by annotation
@ 4 jdk.internal.misc.ScopedMemoryAccess::getByteInternal (34 bytes) force inline by annotation
@ 5 jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes) force inline by annotation
@ 13 jdk.internal.misc.Unsafe::getByte (0 bytes) (intrinsic)
@ 19 java.lang.ref.Reference::reachabilityFence (1 bytes) force inline by annotation
@ 17 org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes) inline (hot)
@ 11 java.lang.foreign.MemorySegment::get (13 bytes) force inline by annotation
\-> TypeProfile (23678/23678 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
@ 4 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::accessHandle (26 bytes) force inline by annotation
@ 9 java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes) force inline by annotation
@ 3 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes) force inline by annotation
@ 46 java.lang.invoke.VarForm::getMemberName (31 bytes) force inline by annotation
@ 49 java.lang.invoke.VarHandleSegmentAsBytes::get (47 bytes) force inline by annotation
@ 14 java.lang.invoke.VarHandleSegmentAsBytes::checkAddress (21 bytes) force inline by annotation
@ 1 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 15 jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes) force inline by annotation
@ 26 jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes) force inline by annotation
@ 16 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 24 jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes) accessor
@ 29 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes) inline (hot)
@ 40 java.lang.invoke.VarHandleSegmentAsBytes::offsetNoVMAlignCheck (39 bytes) force inline by annotation
@ 1 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes) accessor
@ 13 jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes) inline (hot)
@ 43 jdk.internal.misc.ScopedMemoryAccess::getByte (16 bytes) force inline by annotation
@ 4 jdk.internal.misc.ScopedMemoryAccess::getByteInternal (34 bytes) force inline by annotation
@ 5 jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes) force inline by annotation
@ 13 jdk.internal.misc.Unsafe::getByte (0 bytes) (intrinsic)
@ 19 java.lang.ref.Reference::reachabilityFence (1 bytes) force inline by annotation
@ 38 org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes) too big
@ 59 org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes) too big
@ 63 org.apache.lucene.util.bkd.DocIdsWriter::readInts (140 bytes) inline (hot)
@ 1 org.apache.lucene.store.MemorySegmentIndexInput::readByte (134 bytes) inline (hot)
\-> TypeProfile (15569/15569 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
@ 11 java.lang.foreign.MemorySegment::get (13 bytes) force inline by annotation
\-> TypeProfile (23678/23678 counts) = jdk/internal/foreign/MappedMemorySegmentImpl
@ 4 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::accessHandle (26 bytes) force inline by annotation
@ 9 java.lang.invoke.VarHandleGuards::guard_LJ_I (84 bytes) force inline by annotation
@ 3 java.lang.invoke.VarHandle::checkAccessModeThenIsDirect (29 bytes) force inline by annotation
@ 46 java.lang.invoke.VarForm::getMemberName (31 bytes) force inline by annotation
@ 49 java.lang.invoke.VarHandleSegmentAsBytes::get (47 bytes) force inline by annotation
@ 14 java.lang.invoke.VarHandleSegmentAsBytes::checkAddress (21 bytes) force inline by annotation
@ 1 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 15 jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes) force inline by annotation
@ 26 jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes) force inline by annotation
@ 16 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 24 jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes) accessor
@ 29 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes) inline (hot)
@ 40 java.lang.invoke.VarHandleSegmentAsBytes::offsetNoVMAlignCheck (39 bytes) force inline by annotation
@ 1 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes) accessor
@ 13 jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes) inline (hot)
@ 43 jdk.internal.misc.ScopedMemoryAccess::getByte (16 bytes) force inline by annotation
@ 4 jdk.internal.misc.ScopedMemoryAccess::getByteInternal (34 bytes) force inline by annotation
@ 5 jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes) force inline by annotation
@ 13 jdk.internal.misc.Unsafe::getByte (0 bytes) (intrinsic)
@ 19 java.lang.ref.Reference::reachabilityFence (1 bytes) force inline by annotation
@ 109 org.apache.lucene.util.bkd.DocIdsWriter::readInts32 (39 bytes) inline (hot)
@ 7 org.apache.lucene.store.MemorySegmentIndexInput::readInts (57 bytes) inline (hot)
@ 14 java.lang.foreign.MemorySegment::copy (29 bytes) force inline by annotation
@ 1 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 7 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 12 java.util.Objects::requireNonNull (14 bytes) force inline by annotation
@ 25 jdk.internal.foreign.AbstractMemorySegmentImpl::copy (238 bytes) force inline by annotation
@ 2 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 5 jdk.internal.foreign.AbstractMemorySegmentImpl::getBaseAndScale (189 bytes) inline (hot)
@ 3 java.lang.Object::equals (11 bytes) inline (hot)
@ 25 java.lang.Object::equals (11 bytes) inline (hot)
@ 47 java.lang.Object::equals (11 bytes) inline (hot)
@ 70 java.lang.Object::equals (11 bytes) inline (hot)
@ 12 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 15 java.lang.Class::componentType (16 bytes) inline (hot)
@ 1 java.lang.Class::isArray (0 bytes) (intrinsic)
@ 19 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::carrier (5 bytes) accessor
@ 79 jdk.internal.foreign.Utils::checkElementAlignment (17 bytes) force inline by annotation
@ 1 jdk.internal.foreign.Utils::isElementAligned (48 bytes) force inline by annotation
@ 27 jdk.internal.foreign.layout.AbstractLayout::byteAlignment (5 bytes) accessor
@ 33 jdk.internal.foreign.layout.AbstractLayout::byteSize (5 bytes) accessor
@ 86 jdk.internal.foreign.AbstractMemorySegmentImpl::isAlignedForElement (12 bytes) force inline by annotation
@ 3 jdk.internal.foreign.layout.AbstractLayout::byteAlignment (5 bytes) accessor
@ 8 jdk.internal.foreign.AbstractMemorySegmentImpl::isAlignedForElement (26 bytes) force inline by annotation
@ 1 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes) accessor
@ 7 jdk.internal.foreign.NativeMemorySegmentImpl::maxAlignMask (2 bytes) inline (hot)
@ 113 jdk.internal.foreign.AbstractMemorySegmentImpl::checkAccess (30 bytes) force inline by annotation
@ 26 jdk.internal.foreign.AbstractMemorySegmentImpl::checkBounds (54 bytes) force inline by annotation
@ 16 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 122 java.lang.reflect.Array::getLength (0 bytes) (intrinsic)
@ 125 java.util.Objects::checkFromIndexSize (8 bytes) inline (hot)
@ 4 jdk.internal.util.Preconditions::checkFromIndexSize (25 bytes) inline (hot)
@ 137 jdk.internal.foreign.layout.ValueLayouts$AbstractValueLayout::order (5 bytes) accessor
@ 142 java.nio.ByteOrder::nativeOrder (4 bytes) inline (hot)
@ 148 jdk.internal.misc.ScopedMemoryAccess::getScopedMemoryAccess (4 bytes) inline (hot)
@ 153 jdk.internal.foreign.AbstractMemorySegmentImpl::sessionImpl (5 bytes) accessor
@ 159 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetBase (2 bytes) inline (hot)
@ 164 jdk.internal.foreign.NativeMemorySegmentImpl::unsafeGetOffset (5 bytes) accessor
@ 187 jdk.internal.misc.ScopedMemoryAccess::copyMemory (27 bytes) force inline by annotation
@ 12 jdk.internal.misc.ScopedMemoryAccess::copyMemoryInternal (56 bytes) force inline by annotation
@ 5 jdk.internal.foreign.MemorySessionImpl::checkValidStateRaw (33 bytes) force inline by annotation
@ 28 jdk.internal.misc.Unsafe::copyMemory (33 bytes) inline (hot)
@ 9 jdk.internal.misc.Unsafe::copyMemoryChecks (21 bytes) inline (hot)
@ 3 jdk.internal.misc.Unsafe::checkSize (32 bytes) inline (hot)
@ 9 jdk.internal.misc.Unsafe::checkPrimitivePointer (19 bytes) inline (hot)
@ 3 jdk.internal.misc.Unsafe::checkPointer (19 bytes) inline (hot)
@ 6 jdk.internal.misc.Unsafe::checkNativeAddress (28 bytes) inline (hot)
@ 17 jdk.internal.misc.Unsafe::checkPrimitivePointer (19 bytes) inline (hot)
@ 3 jdk.internal.misc.Unsafe::checkPointer (19 bytes) inline (hot)
@ 15 jdk.internal.misc.Unsafe::checkOffset (32 bytes) inline (hot)
@ 12 java.lang.Object::getClass (0 bytes) (intrinsic)
@ 15 jdk.internal.misc.Unsafe::checkPrimitiveArray (22 bytes) inline (hot)
@ 1 java.lang.Class::getComponentType (14 bytes) inline (hot)
@ 1 java.lang.Class::isArray (0 bytes) (intrinsic)
@ 10 java.lang.Class::isPrimitive (0 bytes) (intrinsic)
@ 29 jdk.internal.misc.Unsafe::copyMemory0 (0 bytes) (intrinsic)
@ 32 java.lang.ref.Reference::reachabilityFence (1 bytes) force inline by annotation
@ 36 java.lang.ref.Reference::reachabilityFence (1 bytes) force inline by annotation
@ 27 org.apache.lucene.search.PointRangeQuery$1$1::visit (9 bytes) inline (hot)
@ 27 org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$1::visit (21 bytes) inline (hot)
\-> TypeProfile (1182090/2633788 counts) = org/apache/lucene/search/comparators/NumericComparator$NumericLeafComparator$1
\-> TypeProfile (1451698/2633788 counts) = org/apache/lucene/search/PointRangeQuery$1$1
@ 17 org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes) inline (hot)
@ 17 org.apache.lucene.util.DocIdSetBuilder$BufferAdder::add (24 bytes) inline (hot)
\-> TypeProfile (51012/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$BufferAdder
\-> TypeProfile (87110/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
@ 5 org.apache.lucene.util.FixedBitSet::set (56 bytes) inline (hot)
@ 5 org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes) inline (hot)
\-> TypeProfile (1347015/1347015 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
@ 5 org.apache.lucene.util.FixedBitSet::set (56 bytes) inline (hot)
@ 70 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pushLeft (26 bytes) inline (hot)
@ 22 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::readNodeData (465 bytes) hot method too big
@ 76 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::addAll (98 bytes) recursive inlining is too deep
@ 80 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pop (21 bytes) inline (hot)
@ 84 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pushRight (87 bytes) inline (hot)
@ 56 org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl::seek (32 bytes) inline (hot)
\-> TypeProfile (8447/8447 counts) = org/apache/lucene/store/MemorySegmentIndexInput$SingleSegmentImpl
@ 1 org.apache.lucene.store.MemorySegmentIndexInput::ensureOpen (14 bytes) inline (hot)
@ 12 java.util.Objects::checkIndex (7 bytes) force inline by annotation
@ 3 jdk.internal.util.Preconditions::checkIndex (22 bytes) (intrinsic)
@ 83 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::readNodeData (465 bytes) hot method too big
@ 90 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::addAll (98 bytes) recursive inlining is too deep
@ 94 org.apache.lucene.util.bkd.BKDReader$BKDPointTree::pop (21 bytes) inline (hot)
For OpenSearch, the output is very similar. Let's focus on where they differ. In Elasticsearch, towards the bottom of the output within the DocIdsWriter::readInts32 method, we see this:
@ 27 org.apache.lucene.search.PointRangeQuery$1$1::visit (9 bytes) inline (hot)
@ 27 org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$1::visit (21 bytes) inline (hot)
\-> TypeProfile (1182090/2633788 counts) = org/apache/lucene/search/comparators/NumericComparator$NumericLeafComparator$1
\-> TypeProfile (1451698/2633788 counts) = org/apache/lucene/search/PointRangeQuery$1$1
@ 17 org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes) inline (hot)
@ 17 org.apache.lucene.util.DocIdSetBuilder$BufferAdder::add (24 bytes) inline (hot)
\-> TypeProfile (51012/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$BufferAdder
\-> TypeProfile (87110/138122 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
@ 5 org.apache.lucene.util.FixedBitSet::set (56 bytes) inline (hot)
@ 5 org.apache.lucene.util.DocIdSetBuilder$FixedBitSetAdder::add (9 bytes) inline (hot)
\-> TypeProfile (1347015/1347015 counts) = org/apache/lucene/util/DocIdSetBuilder$FixedBitSetAdder
@ 5 org.apache.lucene.util.FixedBitSet::set (56 bytes) inline (hot)
However, in OpenSearch we instead see this:
@ 27 org.apache.lucene.index.PointValues$IntersectVisitor::visit (0 bytes) virtual call
It seems like Elasticsearch inlines the PointValues$IntersectVisitor::visit calls, while OpenSearch instead performs a virtual call. What does this mean?
So, what's happening here? Higher up in the call stack, the BKDPointTree.addAll method takes a PointValues.IntersectVisitor as input:
public void addAll(
PointValues.IntersectVisitor visitor,
boolean grown
) throws IOException {...
The full source of this method is available in the Lucene GitHub repository.
PointValues.IntersectVisitor is an interface. This instance is then sent down all the way down to the DocIdsWriter::readInts32 method:
private void readInts32(
IndexInput in,
int count,
IntersectVisitor visitor
) throws IOException {
in.readInts(scratch, 0, count);
for (int i = 0; i < count; i++) {
visitor.visit(scratch[i]);
}
}
Where the visit method is called. For the range query, I believe that this happens a bit more than once per matched document (over all indices, we matched 486,989,570 documents).
In OpenSearch, the visit method gets called as a virtual call, while it is inlined as direct call to either PointRangeQuery$1$1.visit or NumericComparator$NumericLeafComparator$1 in Elasticsearch. The reason for this has to do with how it figures out which actual implementation of the visit method to call.
The JVM uses a few rules to determine how to inline polymorphic method calls - methods where more than one implementation exists.
For a more thorough discussion of this, I recommend this blog post written by Aleksey Shipilëv - an OpenJDK engineer. Specifically the section "Three types and beyond" gives some really good insights. The performance numbers from his micro-benchmarks also do seem to match what we have seen in our query benchmarks.
So, why are different inlinings done in Elasticsearch than in OpenSearch?
In Elasticsearch, the IntersectVisitor.visit method is called for two implementations during the benchmark - NumericComparator$NumericLeafComparator$1 and PointRangeQuery$1$1 , meaning that it can be inlined. In OpenSearch, the visit method is called with NumericComparator$NumericLeafComparator$1 , PointRangeQuery$1$1 and PointRangeQuery$1$2 (note the " 2 " at the end) - three of them.
The NumericComparator$NumericLeafComparator$1 is used when sorting document by the timestamp field, in the desc_sort_timestamp operation. The PointRangeQuery$1$1 and PointRangeQuery$1$2 are both used for range queries. Lucene uses different implementation depending on if it thinks that it is cheaper to find matching documents or, finding non-matching documents and then inverting the result.
The range query in the range operation ends up using the inclusive implementation ( PointRangeQuery$1$1 ) in both Elasticsearch and OpenSearch. However, the range query in composite-date_histogram-daily matches more documents, and thus ends up using the exclusive implementation ( PointRangeQuery$1$2 ) in OpenSearch.
But why does Elasticsearch not end up using the exclusive implementation in our benchmark? The reason is that Elasticsearch is slightly smarter about optimising requests with size 0 where the range query and the histogram both use the same numeric field. It ends up not executing the query at all for thecomposite-date_histogram (it instead applies the range restriction as part of the aggregation).
Can we make Elasticsearch use virtual calls to IntersectVisitor.visit when running the range query? Yes - by forcing the execution of the range query in the composite-date_histogram-daily operation in Elasticsearch by changing the request to ask for 1 (or more) documents to be returned:
{
"size": 1,
"query": {
"range": {
"@timestamp": {
"gte": "2022-12-30T00:00:00",
"lt": "2023-01-07T12:00:00"
}
}
},
"aggs": {
"logs": {
"composite": {
"sources": [
{
"date": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "day"
}
}
}
]
}
}
}
}
The only difference from before is changing the size parameter to be 1 instead of 0. Lets create a copy of the Rally track big5rangedebug schedule, named big5rangedebug-histo-size-1 , using this new version of the composite-date-histogram-daily operation:
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
desc_sort_timestamp | 0.91 queries/s | 0.87 queries/s | -4.40% |
composite-date_histogram-daily-size-1 | 0.04 queries/s | 0.04 queries/s | 0 |
range | 0.29 queries/s | 0.33 queries/s | +13.79% |
Throughput values for the big5rangedebug-histo-size-1 schedule.
Higher is better
As we suspected, adding the size parameter to the composite-date-histogram-daily operation resulted in the range query running slower in Elasticsearch, matching the setup in OpenSearch. This strengthens our hypothesis that the inlining is the root cause of the difference in range query performance in our local benchmarks.
Inlining in the JVM is not easy to understand. Let's look at what we have learnt in this section.
So far we have done small experiments to understand the range query performance. We have however used very low iteration counts - I would not use this as a basis for any conclusions. Instead we will return to the original big5 benchmark.
However, this time, we will first run the big5rangedebug-histo-size-1 schedule - in order to trigger the virtual call inlining before the main benchmark run. From what we have learnt, we should expect that the range query will run at about the same speed in Elasticsearch and OpenSearch. These are the relevant results from the big5schedule (again in 90th percentile service times):
Operation | Elasticsearch | OpenSearch | Diff % |
---|---|---|---|
range | 3274.09 ms | 3502.62 ms | +6.98% |
keyword-in-range | 3292.79 ms | 3521.31 ms | +6.94% |
sort_numeric_desc | 1075.16 ms | 1016.13 ms | -5.49% |
sort_numeric_asc | 841.639 ms | 819.042 ms | -2.68% |
big5 schedule results after first running the big5rangedebug-histo-size-1 schedule once.
Lower is better
It seems like our predictions were correct. By first triggering the inlining to happen with the virtual call, OpenSearch now ran about as fast as Elasticsearch. I also tried doing a similar example with the fullbig5 schedule as warmup, and saw similar results.
Elasticsearch using one fewer implementation of the visit method than OpenSearch in the big5 benchmark is the only reason why Elasticsearch was faster at the range, keyword-in-range and sort_numeric_desc operations in this benchmark. This would likely not happen in production.
The sort_numeric_asc does not seem as obviously affected though. It might be interesting to dig into why, but this post is already long enough.
After running the big5 schedule with our warmups, we have learnt some things:
So far in this post, we have been comparing how Elasticsearch and OpenSearch. Let's switch gears to instead see how we can use what we have learnt to improve Lucene - benefiting both Elasticsearch and OpenSearch.
Remember the readInts32 method in the DocIdsWriter class? Here it is, again:
private void readInts32(
IndexInput in,
int count,
IntersectVisitor visitor
) throws IOException {
in.readInts(scratch, 0, count);
for (int i = 0; i < count; i++) {
visitor.visit(scratch[i]);
}
}
I did a small, change to this method (and the very similar readDelta16 method). The idea is to make fewer virtual calls by instead calling another version of the visit method, which takes a DocIdSetIterator as input. This is the conceptual change:
private void readInts32(
IndexInput in,
int count,
IntersectVisitor visitor
) throws IOException {
in.readInts(scratch, 0, count);
DocIdSetIterator disi = docIdSetIterator(scratch, count);
visitor.visit(disi);
}
Lucene chooses 512 as the maximum number of range values to store in each leaf of the BKD tree.
// BKDConfig.java
public static final int DEFAULT_MAX_POINTS_IN_LEAF_NODE = 512;
Which means that the count parameter will be at most 512. I debugged our benchmark to verify this, and it is indeed almost always 512. This means that we should be doing 512 times fewer virtual calls in this method.
The full change required a bit more boilerplate, I forked Lucene and committed it to GitHub.. The commit is on top of the main branch, but also applies cleanly to the 9.8.0 release, which Elasticsearch 8.11.0 uses.
I built Lucene 9.8.0 with my change applied, used that to once again run the big5 schedule. Again, I ran it after the big5rangedebug-histo-size-1 schedule to trigger the virtual call inlining.
Now, we are comparing results between Elasticsearch and Elasticsearch with the Lucene change applied (I expect OpenSearch would see similar results). Some operations saw no significant change, these are the ones that did. As usual, in 90th percentile service times:
Operation | Elasticsearch | Elasticsearch with patch | Diff % |
---|---|---|---|
desc_sort_timestamp | 1707.76 ms | 1013.44 ms | -40.66% |
asc_sort_timestamp | 1814.87 ms | 1057.4 ms | -41.74% |
desc_sort_with_after_timestamp | 1493.22 ms | 889.041 ms | -40.46% |
asc_sort_with_after_timestamp | 1775.64 ms | 1036.73 ms | -41.61% |
desc_sort_timestamp_can_match_shortcut | 245.554 ms | 153.338 ms | -37.55% |
desc_sort_timestamp_no_can_match_shortcut | 245.001 ms | 155.004 ms | -36.73% |
asc_sort_timestamp_can_match_shortcut | 282.434 ms | 170.331 ms | -39.69% |
asc_sort_timestamp_no_can_match_shortcut | 286.081 ms | 182.341 ms | -36.26% |
range | 3274.09 ms | 1858.64 ms | -43.23% |
keyword-in-range | 3292.79 ms | 1789.29 ms | -45.66% |
query-string-on-message-filtered | 1280.72 ms | 1013.25 ms | -20.88% |
query-string-on-message-filtered-sorted-num | 459.897 ms | 328.508 ms | -28.57% |
sort_numeric_desc | 1075.16 ms | 797.731 ms | -25.80% |
sort_numeric_asc | 841.639 ms | 823.912 ms | -2.11% |
Comparing big5 schedule results with and without our change, after first running the big5rangedebug-histo-size-1 schedule once.
Lower is better
In many operations, we see a response time reduction of 40%. In other words almost twice as fast. This is crazy. I did not trust it enough, so I re-ran the benchmark twice, also adding in an additional big5 schedule as an additional warmup. The results were consistent. The same also holds true for the 50th percentile.
It should be noted that the patch seems to bring down the response times to the same level as when running without the warmup - i.e. it seems to undo the performance degradation caused by the virtual calls (as opposed to to the inlining that made Elasticsearch faster than OpenSearch). Of course, in real life scenarios, it is likely that the Elasticsearch (or OpenSearch) JVM have been warmed up enough for the virtual call to happen.
Now, this is only one benchmark - it might that the patch has a larger impact on our data setup. On the other hand, it's very likely not a performance degradation in general. I will create a Pull Request to Lucene with it, and see what the maintainers think.
Edit 2024-03-024: Pull Request created.
We made a small change to Lucene - what has that taught us?
We have dug ourselves quite deep into a rabbit hole of JVM inlining - but we have learnt some important insights.
The range query execution time in the big5 benchmark was heavily affected by which operations had run before it since the last restart. This ended up favouring Elasticsearch in the big5 benchmark for very non-obvious reasons.
In the original benchmark done by Elastic was run 50 times - with no mention of restarting the nodes between each run. I thus believe that their results are minimally (if at all) affected by this favouring issue - and I should be clear with saying that I don't think this was malicious behaviour by Elastic.
Running benchmarks on fresh nodes is better for reproducibility of benchmark. But long running nodes are probably a more realistic production scenario; in real-world production clusters of Elasticsearch and OpenSearch, I find it quite likely that the BKDPointTree::addAll would have been called with at least 3 implementations. This should be part of the warm-up before running range or numerical sorting benchmarks.
The range query is probably as fast in OpenSearch as in Elasticsearch for realistic use. This is to be expected, since Lucene does the heavy lifting.
"Warming up" a JVM for a benchmark is generally seen as a way to speed up subsequent operations, without regard for that it might also slow down other operations.
The main purpose of benchmarks is to predict behaviour in production environments. Thus, rather than seeing the JVM warm-up as a way to speed up operations, it should be seen as a way to get the JVM into a state reasonably similar to what it would be in a production environment.
If you are building libraries, such as Lucene, this is hard to get right. If you are building applications, you can warm up the JVM by running representative workloads on the application that you are building. One way to bypass this issue is to instrument how you code behaves in production. Blunders.io can help with that.
We also saw that a very small change to the Lucene codebase could make some queries almost twice as fast. I find this phenomenon quite common - if you have a good hypothesis of what is taking time, tiny changes can often have as large of an impact as fancy algorithms.
I hope that you have enjoyed this post. Blunders.io can help you bypass the inlining issues, or find what small changes can have a big performance impact for you, by profiling in production. If you want to learn more, or have any questions about this post, please reach out to me at anton@blunders.io.