Elasticsearch Benchmarking, Part 2: Outliers

Published 2023-12-04

This is part two in a series of posts where we look into the details of the search performance benchmarks done in the blog post Elasticsearch vs. OpenSearch: Unraveling the performance gap In this post we explore disk usage of the indices, and dig into one query that was much faster in Elasticsearch than in OpenSearch.

Elasticsearch is a search engine built on top of Lucene. OpenSearch was forked from Elasticsearch in 2021. I would like to thank Elastic for making so much of the benchmark setup available for reproduction.

tl;dr;

In the benchmark setup used by Elastic to benchmark Elasticsearch vs. OpenSearch, Elasticsearch uses less disk because it is set up to not index positions or frequencies. This has the unexpected effect of making some scored queries about 490 times faster in Elasticsearch, since they are not scored. It also has the expected effect of making phrase queries much slower - about 1875x slower. We also see that the slightly newer version of Lucene (9.8.0) in Elasticsearch is responsible for significant speedups.

The Benchmark Setup

Elastic ran their tests in Kubernetes, but I'm running them on my desktop. I have both Elasticsearch version 8.11.0 and OpenSearch 2.11.0 set up on the same desktop. The desktop has an AMD Ryzen 5 5600X CPU, 32 GB of RAM and a 2 TB NVMe SSD. For all tests, I ran either Elasticsearch or OpenSearch - they never ran at the same time, and could thus not impact the performance of the other.

Indexing data

I started by indexing 1,740,763,348 documents/logs into both Elasticsearch and OpenSearch. This is less data than in the original benchmark - I was a bit constrained on disk space. It is however about the same amount of documents per node - the original benchmark would have had about 1.8 billion documents per node (including replicas). For more information on the data and mappings, see the previous post in this series.

Number of documents 1,740,763,348
Uncompressed size ~ 1,500 GB
Compressed as chunks of 10GB ~ 189 GB
Elasticsearch disk usage ~ 352 GB
OpenSearch disk usage ~ 381 GB

For JSON data, compression makes a huge difference (as seen in the table above). Since both Elasticsearch and OpenSearch store the complete input JSON document, it is obvious that they use compression. This is in fact a feature built into Lucene - field compression has been in Lucene since 4 - for more details I recommend this blog post .

The mapping used for the benchmark does set codec to best_compression which corresponds to the HIGH_COMPRESSION mode in Lucene. - this is equivalent as using gzip with the default compression level (6).

Disk Usage and File Types

OpenSearch uses more disk space than Elasticsearch. Why is that? It would be helpful to look at disk usage statistics for different fields. Elasticsearch has an API for doing this, but OpenSearch does not. This functionality is not available in Lucene, possibly because it being a very slow operation (Lucene does not store fields separately - calculating field sizes essentially requires scanning all the stored data).

File Types

Lucene does however store different data types in different files. Using a small awk script I found on Stack Overflow I summed up the size per file extensions. The table below includes all files types found in any index for both Elasticsearch and OpenSearch, excluding files below 1MB.

Extension ES Disk Usage OS Disk Usage File purpose
.fdt 219.35 GB 219.25 GB Stored fields - in our case the mainly the _source field
.doc 55.87 GB 63.94 GB Which documents contains which term
.dvd 37.42 GB 37.44 GB Per-document values (also known as doc_values)
.kdd 27.61 GB 27.62 GB Tree structure for numeric indexing
.pos - 20.16 GB The position of terms within documents
.tim 10.42 GB 10.05 GB Stores terms with pointers to other files (e.g. .doc)

The full documentation of these file extensions is available in the Lucene Javadoc .

Position & Frequency data

OpenSearch uses ~29GB more disk space. This can be explained by it having .pos files (which Elasticsearch does not have) taking up ~20GB and ~8GB more data in .doc files. Why is this?

It turns out that the test data only has one text field ("message"). In the Elasticsearch setup for the benchmark, the type data is set to match_only_text , while it is set to text in the OpenSearch setup. The match_only_text type is a new feature in Elasticsearch, which disables storing term positions and frequencies. Positions are queries where the position of the term within the document matters - e.g. the phrase query "New York" requires "York" to appear after "New". Frequencies keep track of the number of times a term was mentioned in a document for scoring purposes.

Therefore, our Elasticsearch setup does not need the .pos files and can have smaller .doc files (which would contain frequencies).

One could argue that using a different field type in OpenSearch would make the comparison unfair. I started out thinking this was fair - using a new feature, specifically designed for log data, to reduce disk space. However, after discovering the performance impact I'm not so sure that it is fair. Read on for details!

Search Benchmarks

We have enough data to benchmark search performance. Since we are running both Elasticsearch and OpenSearch on the exact same hardware, comparing the results is useful.

With all the data indexed, I ran the es-rally "race" from the elasticsearch-opensearch-benchmark repository (ES Rally is the benchmarking tool used by Elastic). I first ran it against freshly started Elasticsearch node. I then shut down the Elasticsearch Node. I then ran the same race against OpenSearch.

The full results are interesting, but you will have to wait for the next post in this series to see that. You see, I found an outlier query that was 198 times slower in OpenSearch than in Elasticsearch. This is intriguing - let's understand why it happened.

The Outlier

This is the /_search request body where the response time was way better in Elasticsearch than in OpenSearch:

{
  "query": {
    "query_string": {
      "query": "message: monkey jackal bear"
    }
  }
}

The rally benchmarks reported the 90th percentile of the response time to be 6.49ms for Elasticsearch and 1288.53ms for OpenSearch.

Here, I switched tools to Apache bench (ab) - it is a simpler and more commonly used tool than Rally. I used it to test throughput for this query:

ab -l -t 120 -n 20000000 -c 8 \
 -T 'application/json' \
 -p monkey_jackal_bear_or.json \
 localhost:9200/lucene-benchmark-dev/_search

The monkey_jackal_bear_or.json file contains the request body from above. We tell ab to run for 120 seconds (-t 120), with 12 concurrent requests (-c 12). We also need to tell it to run at least 20 million requests (-n 20000000) - otherwise it will stop after 50 000 requests. We will use this same benchmark command (with different input data) throughout this post. For all except one of the benchmarks, they managed to push the CPU close to 100% - i.e. we are CPU-bound.

The total throughput for Elasticsearch was 1453.97 requests per second, while about 1.2 per second for OpenSearch. I.e. Elasticsearch was about 1200 times faster.

A Query Bug

However, there is a bug in the query. The query is named query-string-on-message in the benchmark, indicating that it is meant to search the message field. In Elasticsearch it does - we can find this out using the validate API and with the explain=true:

> curl -s -XGET 'localhost:9200/.ds-logs-benchmark-dev-2023.11.30-000001/_validate/query?explain=true' \
       -d '{"query":{"query_string":{"query":"message: monkey jackal bear"}}}'\
   -H 'content-type: application/json' | jq .
{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": ".ds-logs-benchmark-dev-2023.11.30-000001",
      "valid": true,
      "explanation": "ConstantScore(message:monkey) ConstantScore(message:jackal) ConstantScore(message:bear)"
    }
  ]
}

This looks reasonable, we search for the three different terms in the message field. The reason why Elasticsearch wraps the queries in a ConstantScore query is that the field does not have frequencies (from the match_only_text above).

In OpenSearch, things look a bit different:

curl -s -XGET 'localhost:9200/.ds-logs-benchmark-dev-000001/_validate/query?explain=true'\
 -d '{"query":{"query_string":{"query":"message: monkey jackal bear"}}}'\
 -H 'content-type: application/json' |  jq  .
{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": ".ds-logs-benchmark-dev-000001",
      "valid": true,
      "explanation": "message:monkey (MatchNoDocsQuery(\"failed [metrics.tmax] query, caused by number_format_exception:[For input string: \"jackal bear\"]\") | MatchNoDocsQuery(\"failed [metrics.tmax] query, caused by number_format_exception:[For input string: \"jackal bear\"]\") | MatchNoDocsQuery(\"failed [metrics.tmax] query, caused by number_format_exception:[For input string: \"jackal bear\"]\") | MatchNoDocsQuery(\"failed [metrics.tmax] query, caused by number_format_exception:[For input string: \"jackal bear\"]\") | MatchNoDocsQuery(\"failed [metrics.tmax] query, caused by number_format_exception:[For input string: \"jackal bear\"]\") | aws.cloudwatch.ingestion_time:jackal bear | (data_stream.type:jackal data_stream.type:bear) | ecs.version:jackal bear | event.dataset:jackal bear | (data_stream.dataset:jackal data_stream.dataset:bear) | data_stream.type.keyword:jackal bear | tags:jackal bear | data_stream.dataset.keyword:jackal bear | (message:jackal message:bear) | process.name:jackal bear | host.name.keyword:jackal bear | cloud.region:jackal bear | agent.version:jackal bear | agent.id:jackal bear | log.file.path:jackal bear | input.type:jackal bear | agent.ephemeral_id:jackal bear | agent.type:jackal bear | event.id:jackal bear | data_stream.namespace.keyword:jackal bear | meta.file:jackal bear | aws.cloudwatch.log_group:jackal bear | (host.name:jackal host.name:bear) | agent.name:jackal bear | (data_stream.namespace:jackal data_stream.namespace:bear) | aws.cloudwatch.log_stream:jackal bear)"
    }
  ]
}

What's happening here is that OpenSearch is searching in all fields. This is because the query string message: monkey jackal bear is parsed as (message: monkey) OR (jackal bear). This is due to the default operator being OR - the default field in OpenSearch is *, meaning any field. Thus it searches for "monkey" in the message field, and jackal bear in all fields. For the message field, this is split to "jackal" OR "bear", but for other fields it tries to search for "jackal bear" as a single term (and sometimes succeds).

The reason that ElasticSearch does not do that is that ElasticSearch supports setting this default in index settings (index.query.default_field), and the benchmark setup does set it to ["messages"].

Fixing the Bug

We can modify the query to search only in the message field using our friends, the parentheses:

{
  "query": {
    "query_string": {
      "query": "message: (monkey jackal bear)"
    }
  }
}

When running this query using the same benchmark as above, Elasticsearch stays fast at 1545.03 requests per second. OpenSearch however sped up to 3.15 requests per second. Still 490 times slower than Elasticsearch - but 2.6 times faster than before the fix. This is notable, since as far as I can tell, there are no matches for jackal bear outside of the message field.

Understanding why Elasticsearch is faster

So, why is it so much slower in OpenSearch? First, let's have a look at profiling data from Elasticsearch and OpenSearch. Both samples are from running the query above in the throughput benchmark.

First up, Elasticsearch:

 

Next up, OpenSearch (you might need to scroll a bit further in this one):

 

The first thing to note is that both the samples spend almost all of their time inside methods named score. In Lucene, the scoring process does both document matching and scoring - e.g. for our OR query it is responsible to combine the matching documents from each sub-query.

The second thing to note is that Lucene-level code is using almost all CPU. This is good - Lucene is supposed to be the bottleneck - it does all the computationally hard work. But why is it so much slower in OpenSearch?

I attached debuggers to both Elasticsearch and OpenSearch while running queries (outside of the benchmarking) to understand more. After a lot of poking around, I learned that there was some different behaviour in the BooleanWeight class (which is in the Lucene codebase). The version used in Elasticsearch returned a bulk scorer, while the one in OpenSearch did not (bulk scoring was introduced in Lucene 5, and is a way to speed up scoring by using bulks of documents).

Lucene 9.8 improvements

Elasticsearch 8.11 uses Lucene 9.8.0. OpenSearch 2.11 uses Lucene 9.7.0. I.e. the code bulk scorer changes must have been then. I did a bunch of git blame :ing (should really be called git praise ) before I remembered that release notes are a thing. And indeed, the 9.8.0 release notes says:

Faster computation of top-k hits on boolean queries. Lucene's nightly benchmarks report a 20%-30% speedup for disjunctive queries and a 11%-13% speedup for conjunctive queries since Lucene 9.7

This is great!

The Unreasonable Impact of Not Having Frequency data

But those release notes talk about a 20-30% improvement. We saw OpenSearch being some 530 times slower (after fixing the fields). That is not the same as 30%. What is going on?

It turns out that the real performance gain is related to finding "top-k hits". Per default, Elasticsearch, OpenSearch and Lucene will return hits based on their score - how well they matched the query. This is a basic premise of information retrieval - the most relevant documents should be returned first - this is especially important for our monkey jackal bear query, since it matches 149,210,239 documents in our example dataset.

Remember that ConstantScore part of the Elasticsearch query above? That disables scoring. Maybe this affects things? Let's force scoring in Elasticsearch by using a bool query instead of the query string:

{
  "query":{
    "bool":{
      "should": [
        {"term": {"message":{"value":"monkey"}}},
        {"term": {"message":{"value":"jackal"}}},
        {"term": {"message":{"value":"bear"}}}
      ]
    }
  }
}

Elasticsearch managed 1616 requests per second, while OpenSearch did roughly 3.16. Structuring the query this way did not make any difference. This is actually expected - since there are no term frequencies in the Elasticsearch index each term gets a score of 1.0 anyway. I.e. it behaves as if it was a constant_score query.

Making OpenSearch Not Use Frequencies

So, we can't make Elasticsearch score the documents - it does not have the data. But we can make OpenSearch not score the terms, by wrapping all the terms in a constant_score query:

{
  "query":{
    "bool":{
      "should": [
        {"constant_score": {"filter":{"term": {"message":{"value":"monkey"}}}}},
        {"constant_score": {"filter":{"term": {"message":{"value":"jackal"}}}}},
        {"constant_score": {"filter":{"term": {"message":{"value":"bear"}}}}}
      ]
    }
  }
}

Now, that made OpenSearch 281x faster! OpenSearch now runs at 890 requests per second, while Elasticsearch got up to 1618.

Explaining the 2x Difference

But why is Elasticsearch 2x faster? It turns out that both Lucene 9.7.0 and 9.8.0 has concepts that skip scoring of documents that can't make it into the top hits. In 9.7.0 this happens on a per-document basis. It seems like most of the improvements in Lucene 9.8 comes from moving this functionality using a windowed approach. From what I understand, each window can report a max score - if this score is so low that the window can't have any impact on the top hits returned, the window can be skipped.

This is smart - if you want to look more into it the code is in MaxScoreBulkScorer.java in Lucene. I do believe that these improvements, in Lucene 9.7 to 9.8, are the reason for the 2x speedup. This is higher than the 20-30% mentioned in the release notes - but the effects are likely larger when many documents share the exact same score. I attached a debugger, and noted that in one segment it seemed like the scorer didn't have to look at any windows after it had looked at 129,100 documents (out of 197,500,201).

One interesting thing that I learned was that the CancelableBulkScorer in Elasticsearch probably limits the effectiveness of the MaxScoreBulkScorer , since it divides up scoring into quite small batches - sometimes the size of a single inner window of the MaxScoreBulkScorer . It would probably work faster with larger batches.

So, it is likely that the 2x speedup is due to the improvements in Lucene 9.8.0. OpenSearch will presumably get them in the next OpenSearch release, since their main branch uses Lucene 9.8.0.

What are we actually benchmarking?

Now, I would like to take a step back - what use case are we actually benchmarking? We are searching for documents matching any of 3 terms ("monkey", "jackal" or "bear"). We are then sorting them based on the best hit, but since we don't have scores that more or less translates to "the first ones that had all 3".

Remember that we are supposed to be searching logs. In logs you don't generally sort by the best match - you probably are sorting by the timestamp of the log. Let's go back to the query string query (with the field fix) and sort by the @timestamp field:

{
  "query": {
    "query_string": {
      "query": "message: (monkey jackal bear)"
    }
  },
  "sort": [
    {
      "@timestamp": "desc"
    }
  ]
}

Here, Elasticsearch gets to about 12.91 requests per second, while OpenSearch manages about 12.49 requests per second. This is a much smaller difference. I did have a look for performance improvements in that could affect this, and found PR #12475 in Lucene (made it into 9.8.0).

Here I'll stop digging into this particular query. We have gone from a 160000%, quite unexpected, difference to a difference of less than 5% explained by index setup Lucene improvements that will be available in the next OpenSearch release.

Phrase Queries Become 1875 Slower

The large difference that we started with is explained by seemingly unintended behaviour stemming from using the match_only_text field type (as well as a bug in the query). But what about intended effects?

Let's have a look at this little query:

{
  "query": {
    "query_string": {
      "query": "message: \"monkey jackal\""
    }
  }
}

Here we search for the phrase "monkey jackal", i.e. the word "monkey" followed by the word "jackal". Our OpenSearch setup manged to do 7.50 requests per second.

For Elasticsearch, it timed out as ab has a default timeout of 30s. I gradually increased timeouts and concurrent request, until I finally managed to get one request finishing in in 247 seconds. That is just above 4 minutes, or about 0.004 requests per second.

Since this query lead to 100% utilization of one of the machine's resources (disk reads), we can compare this to the throughput of the OpenSearch queries. The OpenSearch setup was 1875 times faster.

Explaining the Slowdown

Now, to be fair the Elasticsearch documentation does say that match_only_text fields

performs slower on queries that need positions.

The reason is that Elasticsearch has to read, decompress and parse all documents containing "monkey" and "jackal", and then look at the source field to see if the words are close to each other. This is what the profiling data for Elasticsearch looks like when executing this query:

 

Even though are not CPU-bound, we can still use the CPU breakdown to show what the code is doing. It is mostly reading compressed JSON data from the source field.

Elasticsearch does use pre-filtering of terms in phrase queries by only looking at documents that contain all terms. For example, if we instead search for the phrase "monkey jackal bear" ,the Elasticsearch setup manages 0.09 request per second (OpenSearch did 7.36).

Is match_only_text Good for Logs?

The documentation also says, regarding match_only_text :

It is best suited for indexing log messages.

Which is presumably why the benchmark uses it. I, however would not at all agree with this statement.

In my experience, when searching for logs it is very common to search for phrases - e.g. "user logged in" or "request completed" . It is also common that one uses Kibana (or OpenSearch Dashboards) to search logs - where the query parser allows users to enter phrase queries however they want. I would not be surprised if this sometimes causes Elasticsearch clusters to become unresponsive.

I think that match_only_text fields should only be used if you really know what you are doing, and probably only if you control exactly which queries are being run.

What Does the Original Post From Elastic Say?

The original blog post has a section called "Text querying" which contains charts for a "Simple Query" - which according to the charts.py file in the Elastic benchmarking repository is based on the original message: monkey jackal bear , with some filtered variants. They do not really try to understand what makes it faster in Elasticsearch.

In the paragraph below they say that:

Text querying is foundational and crucial for full-text search, which is the primary feature of Elasticsearch. Text field queries allow users to search for specific phrases, individual words, or even parts of words in the text data.

Which mentions phrase queries as a capability. But, as we have seen in this post, phrase queries in the benchmarked setup of Elasticsearch are very slow. They later say that:

Elasticsearch was still 13% more space efficient

While not mentioning that it is all due to a lack of positional data, which makes the (fairly common) phrase query slow. I think that it would be fair to highlight that the 13% comes at the cost of phrase queries being slow, by e.g. including phrase queries in the benchmark.

Takeaways

There are a few takeaways from this post:

  • Most of the CPU usage at search time is spent in Lucene. This is true for both Elasticsearch and OpenSearch.
  • Exactly how you index your data severely impacts the performance of your queries. Small changes can have a large impact. For the queries we looked at in this post, index setup mattered a lot more than whether you used Elasticsearch or OpenSearch.
  • Most of the changes not explained by index setup were due to improvements in Lucene 9.8.0, which OpenSearch will get soon.
  • The match_only_text field type should be used with great care. It can save you disk space, but also severely limits what you can safely query. I would not recommend it for log clusters where many people (e.g. developers) can write their own phrase queries.
  • Using a profiler can help you understand what your cluster is doing. Using this knowledge, you can improve query performance by changing mappings or query structure. With blunders.io, you can do this in production without any noticeable performance impact.

This became a quite long post. I hope that you have found it interesting. If you have any questions, feel free to reach out to me at anton@blunders.io