This is the first post in a series where I intend to explore some benchmarks between Elasticsearch and
OpenSearch.
In this first post, we find we don't make it to the actual benchmarking - we get stuck at data ingestion.
tl;dr;
It turns out that, during Elasticsearch (and probably OpenSearch) indexing of (generated but realistic) log data, ~10% of CPU and ~30% of memory allocations are done to parse date strings. This is for the data input and index template used in benchmarks published by Elasticsearch.
I believe that this is applicable to real-world data as well, especially logs.
This is due to date parsers in java.time being slow. We show that the date parsing can be done at least ~50x faster and thus improve indexing speeds by ~10% (assuming CPU-bound indexing). Read on for details.
The data used for the benchmark can be generated using instructions of the repository above. A corpus generator tool is used to generate data, using templates from the benchmarking repository. The data looks a lot like log data
from some cloud provider, although with some lorem ipsum in the message field. I believe that
this is a good benchmark - logs are a very common use case for both Elasticsearch and OpenSearch. Here is a
sample:
{"@timestamp":"2023-01-01T05:09:40.000Z","aws.cloudwatch":{"log_stream":"tulipcoyote","ingestion_time":"2023-11-04T21:57:58.605Z","log_group":"/var/log/messages"},"cloud":{"region":"us-east-2"},"log.file.path":"/var/log/messages/tulipcoyote","input":{"type":"aws-cloudwatch"},"data_stream":{"namespace":"default","type":"logs","dataset":"generic"},"process":{"name":"systemd"},"message":"2023-11-04T21:57:58.605Z Nov 04 21:57:58 ip-88-190-95-155 systemd: snake biter fang spur moose ox warlock crusher zebra heron bat face glass rover quilldutchess","event":{"id":"bevelchest","ingested":"2023-11-04T21:38:47.605716183Z","dataset":"generic"},"host":{"name":"longlegs"},"metrics":{"size":3091,"tmin":1},"agent":{"id":"baac7358-a449-4c36-bf0f-befb211f1d38","name":"longlegs","type":"filebeat","version":"8.8.0","ephemeral_id":"baac7358-a449-4c36-bf0f-befb211f1d38"},"tags":["preserve_original_event"]}
I generated about 50GB of this data, which I used to test indexing.
Hardware & Software setup
I ran the benchmark on my desktop computer. It sports an AMD Ryzen 5 5600X CPU, 32 GB of RAM and a 2 TB NVMe
SSD,
which was used for testing. The benchmark is thus not fully representative of a production setup, it will only
be used for testing the relative performance on the same machine. In my experience of running Elasticsearch,
results from local testing usually apply to production setups, assuming a large enough local dataset
(usually a few tens of GB) and similar CPU/Storage characteristics (SSDs will differ from HDDs).
I ran Elasticsearch 8.11.0, which is the latest version published at the time of writing. I gave it 8 GB of
memory (-Xmx8G), which would be low for a production setup, but turned out to be plenty. It is important to note
that, since Elasticsearch uses the G1 Garbage Collector, having larger heaps would make GC pauses longer.
I applied the index templates and datastream config from the benchmarking repository. I then started indexing using a small Go programme based on the bulk indexing example in the Go Elasticsearch Client. This turned out to be a joy to work with - the bulk clients have gotten a
lot better with time.
Indexing & Profiling
While indexing, the Elasticsearch Java process used about 70% of the available CPU, while the other indexing
processes used ~20% for a total utilisation of ~90%. This tells us that the indexing, with this dataset, this
index template and this hardware, is CPU bound.
I discovered this by running the Blunders.io profiler on the Elasticsearch process. The profiler is based on
Java Flight Recorder, which has a very small overhead. The profile data also lets us see that we have plenty of
heap to spare, but that the memory allocation rate is quite high at ~3 GB/s. The Garbage Collector (G1) does a
good job, but has to work hard. The vertical lines in the chart below are Garbage Collections.
But what is ElasticSearch using this CPU and allocated memory for? This is where things get
interesting.
Drilling Down
The following Flame Chart (or Icicle Chart, if you prefer) shows how CPU time was spent in
different methods and their dependents.
Â
The most interesting part is in IndexShard.applyIndexOperation and methods called by it. Roughly 50% of the total CPU
time is used
by Lucene to build its inverted index - this is in the IndexWriter.addDocument tree. What surprised me is that ~35% of the CPU time - almost as much as indexing - was spent on parsing JSON.
What surprised me more is that roughly 10% of the total CPU time was spent parsing date
strings.
There are two different code paths for this, since there are two date fields in the input data with different
levels of nesting. Both of them end up in DateTimeFormatter$ClassicFormat.parseObject, which is a JVM standard library
method for parsing dates.
Memory allocation
Let's have a look at the memory allocations done by the same code. Here we are looking at the amount of memory
allocated, not the number of allocations.
Â
Here we can see the same split between Lucene and JSON parsing. However, Lucene is only responsible for ~8% of
the allocated memory. This speaks to the large amounts of effort that has been put into making Lucene memory
efficient (especially at indexing).
Instead, ~70% of all memory is allocated when parsing JSON. ~32% of the total allocated memory is allocated to
parse date strings.
What is the date parser doing?
Let's dig in and understand a bit more why date parsing takes so much time and memory.
Elasticsearch has a date field, which supports multiple date formats. The index template
used in the benchmark does not specify a specific format, which means that it is defaulting
to the default format of strict_date_optional_time||epoch_millis.
The || in the middle
means that ElasticSearch will first try parsing the date as the first format, and if failed
try the next.
The dates in the benchmark data are ISO formatted (well formed logs, nice!), which means that
the parser being used is strict_date_optional_time, defined in DateFormatters.java.
Isolating the code
I extracted this code into a separate repository for demonstration and benchmarking purposes. The code is
available at antonha/es-date-parse-bench-nov-2023.
This is what the CPU Flame Graph looks like if we only parse dates in a loop, using the same date parser:
Â
This looks the same as the profiling data above, just more zoomed in. We can see that it seems like
a lot of the time is spent in the DateTimeFormatterBuilder$CompositePrinterParser , copying the Parsed object. This is done when the CompositePrinterParserparses optional parts of the text -
presumably to not be able to revert if the optional part is not there. The strict_date_optional_time has 8 optional parts, with a max optional "depth" of
4 (which corresponds to the depth in the Flame Graph).
Can we do better?
It seemed to me like this could be done faster. In order to benchmark some alternatives, I set up a JMH
benchmark in the previously mentioned antonha/es-date-parse-bench-nov-2023. JMH is a benchmarking tool provided by openjdk.
I benchmarked parsing a few different ISO 8601-formatted date-time strings, all with UTC time zone
(e.g. 2023-01-01T23:38:34.000Z).
This is probably the most common format for logs, and worth optimising for.
The strict_date_optional_time parser ran at about 1100ns per parsed date.
I tried using the strict_date_time parser, which has none of the optional
stuff and thus might be faster. It was faster, but only about 2x at ~500ns per parsed date. This is just
slightly slower than Instant.parse, a standard library method that should do
the same thing.
I wanted to see if Regex could be faster, so I implemented a very basic variant. It clocked in at about 260ns
per parsed date.
I then decided to write a straight forward parser, essentially just a bunch of String.charAt with if-statements. This
parser became gradually less simple as I handled more error states, it ended up at roughly 200 lines of code.
And it runs at about 22ns per parsed date, or about 50x faster than the strict_date_optional_time.
After implementing said parser, I thought of Googling for "fast java date parser". I found the ethlo/itu repository which provides a date parser much faster than the standard library. It was slightly slower than the parser
I implemented, at about 35ns per parsed date. However, as far as I can tell, it is not able to be as lenient in
parsing as strict_date_optional_time (this might be a good thing). I
decided to continue using the parser I implemented for continued testing.
Here are the full benchmark results:
Benchmark
(dateString)
Mode
Cnt
Score
Error
Units
Benchmark.benchCharParser
2023-01-01T23:38:34.000Z
avgt
5
22.197
±
0.229
ns/op
Benchmark.benchCharParser
1970-01-01T00:16:12.675Z
avgt
5
22.129
±
0.170
ns/op
Benchmark.benchCharParser
5050-01-01T12:02:01.123Z
avgt
5
22.235
±
0.288
ns/op
Benchmark.benchESParse
2023-01-01T23:38:34.000Z
avgt
5
1108.601
±
16.202
ns/op
Benchmark.benchESParse
1970-01-01T00:16:12.675Z
avgt
5
1115.164
±
40.150
ns/op
Benchmark.benchESParse
5050-01-01T12:02:01.123Z
avgt
5
1107.708
±
14.095
ns/op
Benchmark.benchESStrictParse
2023-01-01T23:38:34.000Z
avgt
5
500.326
±
3.116
ns/op
Benchmark.benchESStrictParse
1970-01-01T00:16:12.675Z
avgt
5
496.774
±
4.574
ns/op
Benchmark.benchESStrictParse
5050-01-01T12:02:01.123Z
avgt
5
498.732
±
4.134
ns/op
Benchmark.benchITUParser
2023-01-01T23:38:34.000Z
avgt
5
35.242
±
1.965
ns/op
Benchmark.benchITUParser
1970-01-01T00:16:12.675Z
avgt
5
35.028
±
0.434
ns/op
Benchmark.benchITUParser
5050-01-01T12:02:01.123Z
avgt
5
34.107
±
0.367
ns/op
Benchmark.benchInstantParse
2023-01-01T23:38:34.000Z
avgt
5
470.640
±
1.939
ns/op
Benchmark.benchInstantParse
1970-01-01T00:16:12.675Z
avgt
5
484.131
±
4.044
ns/op
Benchmark.benchInstantParse
5050-01-01T12:02:01.123Z
avgt
5
482.272
±
2.527
ns/op
Benchmark.benchRegex
2023-01-01T23:38:34.000Z
avgt
5
283.709
±
4.295
ns/op
Benchmark.benchRegex
1970-01-01T00:16:12.675Z
avgt
5
271.189
±
0.452
ns/op
Benchmark.benchRegex
5050-01-01T12:02:01.123Z
avgt
5
274.563
±
7.238
ns/op
The source code for the benchmark is available in Benchmark.java of the repository above. The benchmark tests each solution with 3 different date strings.
Show me the code
The code for the CharDateParser is below. I do not claim that this code is
pretty. I also don't claim that it is the fastest possible implementation - I'm 100% certain that it is not. But
it is much faster than the standard date parser, and it is not that much code.
This code parses dates the same as the strict_date_optional_time parser, as far as I can tell. It is likely that it misses some edge cases - I would happily accept PRs fixing
those. I did set up unit tests, which verifies that it behaves the same as the strict_date_optional_time parser for some 230 000 different randomly-generated
date strings with different supported patterns and time zones, so it can't be too far off.
Using the custom parser
I also tried to build Elasticsearch using my custom parser - it was fairly easy to get it working once I told
the Elasticsearch build setup that I didn't care about its sanity checks.
First, I tried to index 10,000,000 documents with each parser, clearing the data between runs.
With stock Elasticsearch, this took 1m25s (116,353 docs/s). With the custom parser, it took 1m12s
(138,308 docs/s). This is a ~18% improvement.
The one minute is a bit short to get a good sense of the long term performance. Lucene spends more time
merging as the index grows, and thus a smaller proportion of CPU would be spent on parsing new documents.
I tested indexing 200,000,000 documents - again clearing the data between runs. With stock Elasticsearch, this
took 29m5s (114,601 docs/s) and with the custom parser it took 26m36s (125,250 docs/s). This is a ~9%
improvement.
It is always comforting when benchmarks show that you can improve as much as the profiling predicted.
The integration I made was very hacky - not good enough for a PR to Elasticsearch. I will create an issue
for Elasticsearch. I will also happily contribute with/to a PR, as long as that does not prevent me from
also contributing to OpenSearch on this. I do believe that OpenSearch has the same performance issue, since
most of the date parsing code is way older than the fork.
Edit, also 2023-11-13: I did also did open an issue for OpenSearch , but it turns out that another person had already identified the same problem in #8361.
They already had some POC code - my code and theirs have a lot of similarities - and it does
look like theirs is also way faster than the standard library parsers.
Potential impact
Using a faster date parser can decrease indexing time for data with date fields quite a bit.
How much will depend on how much of your data are date strings. If you have 10 date strings and no
significant text date parsing will matter much more than if you have very long text fields and one timestamp.
The impact will also depend on if you are CPU-bound or IO-bound for indexing - with slower
storage or network parsing speed might not matter.
Often, searches are made against nodes at the same time as they are indexing. This means that the CPU usage
and memory pressure caused by indexing affects search performance. A faster date parser would thus also improve
search performance - but it is really hard to predict the impact of that. With low indexing rates, the impact
would probably be negligible.
Takeaways
I was honestly quite surprised about what I found when investigating for this blog post, in a few different
ways.
I'm surprised that the standard java.time parsers performed so poorly. Java
is
clearly capable of parsing dates faster. To be fair, the standard library is way more flexible than my custom
parser. But if it was more optimised, a lot of compute could be saved, since it is so well used.
Given its slow performance, it surprises me that both Elasticsearch (and probably OpenSearch) uses the
standard library for date parsing. It is such a low hanging fruit to improve indexing performance. For some
this would barely matter, but for others it would be a huge improvement.
Lastly, I'm surprised that I didn't know about the slowness of the date parsers - even though I have worked
with Java for 10+ years. I could have easily made the same mistake in my own code. This is why I think that
always-on profilers in production is such a great tool - I learn about such performance regressions quickly.
Parsing date strings is a quite common operation - there are probably millions of JVM
processes doing this across the world. Each of those processes uses energy - I can't help
but wonder how much electricity is used for parsing dates. I hope that it is not very much.
If you wonder if you are spending all of your computing power on something as easily fixed, I do
run a company which can help. Blunders.io is a profiler that you can run in
production. I'm happy to help you understand your performance issues, regardless of if you became a user or not.
Reach out to me at anton@blunders.io if you are interested.