Elasticsearch Benchmarking, Part 1: Date parsing

Published 2023-11-13

This is the first post in a series where I intend to explore some benchmarks between Elasticsearch and OpenSearch. In this first post, we find we don't make it to the actual benchmarking - we get stuck at data ingestion.

tl;dr;

It turns out that, during Elasticsearch (and probably OpenSearch) indexing of (generated but realistic) log data, ~10% of CPU and ~30% of memory allocations are done to parse date strings. This is for the data input and index template used in benchmarks published by Elasticsearch. I believe that this is applicable to real-world data as well, especially logs.

This is due to date parsers in java.time being slow. We show that the date parsing can be done at least ~50x faster and thus improve indexing speeds by ~10% (assuming CPU-bound indexing). Read on for details.

The Benchmark Setup

Elasticsearch is a widely used search engine, built on top of Apache Lucene. A few months back, Elastic posted a blog post where they benchmarked Elasticsearch against OpenSearch (OpenSearch was forked from ElasticSearch in 2021).

I would like to give kudos to Elastic for publishing their method, very well documented in the elastic/elasticsearch-opensearch-benchmark GitHub repository.

Data

The data used for the benchmark can be generated using instructions of the repository above. A corpus generator tool is used to generate data, using templates from the benchmarking repository. The data looks a lot like log data from some cloud provider, although with some lorem ipsum in the message field. I believe that this is a good benchmark - logs are a very common use case for both Elasticsearch and OpenSearch. Here is a sample:

{
  "@timestamp": "2023-01-01T05:09:40.000Z",
  "aws.cloudwatch": {
    "log_stream": "tulipcoyote",
    "ingestion_time": "2023-11-04T21:57:58.605Z",
    "log_group": "/var/log/messages"
  },
  "cloud": {
    "region": "us-east-2"
  },
  "log.file.path": "/var/log/messages/tulipcoyote",
  "input": {
    "type": "aws-cloudwatch"
  },
  "data_stream": {
    "namespace": "default",
    "type": "logs",
    "dataset": "generic"
  },
  "process": {
    "name": "systemd"
  },
  "message": "2023-11-04T21:57:58.605Z Nov 04 21:57:58 ip-88-190-95-155 systemd: snake biter fang spur moose ox warlock crusher zebra heron bat face glass rover quilldutchess",
  "event": {
    "id": "bevelchest",
    "ingested": "2023-11-04T21:38:47.605716183Z",
    "dataset": "generic"
  },
  "host": {
    "name": "longlegs"
  },
  "metrics": {
    "size": 3091,
    "tmin": 1
  },
  "agent": {
    "id": "baac7358-a449-4c36-bf0f-befb211f1d38",
    "name": "longlegs",
    "type": "filebeat",
    "version": "8.8.0",
    "ephemeral_id": "baac7358-a449-4c36-bf0f-befb211f1d38"
  },
  "tags": [
    "preserve_original_event"
  ]
}

I generated about 50GB of this data, which I used to test indexing.

Hardware & Software setup

I ran the benchmark on my desktop computer. It sports an AMD Ryzen 5 5600X CPU, 32 GB of RAM and a 2 TB NVMe SSD, which was used for testing. The benchmark is thus not fully representative of a production setup, it will only be used for testing the relative performance on the same machine. In my experience of running Elasticsearch, results from local testing usually apply to production setups, assuming a large enough local dataset (usually a few tens of GB) and similar CPU/Storage characteristics (SSDs will differ from HDDs).

I ran Elasticsearch 8.11.0, which is the latest version published at the time of writing. I gave it 8 GB of memory (-Xmx8G), which would be low for a production setup, but turned out to be plenty. It is important to note that, since Elasticsearch uses the G1 Garbage Collector, having larger heaps would make GC pauses longer.

I applied the index templates and datastream config from the benchmarking repository. I then started indexing using a small Go programme based on the bulk indexing example in the Go Elasticsearch Client. This turned out to be a joy to work with - the bulk clients have gotten a lot better with time.

Indexing & Profiling

While indexing, the Elasticsearch Java process used about 70% of the available CPU, while the other indexing processes used ~20% for a total utilisation of ~90%. This tells us that the indexing, with this dataset, this index template and this hardware, is CPU bound.

I discovered this by running the Blunders.io profiler on the Elasticsearch process. The profiler is based on Java Flight Recorder, which has a very small overhead. The profile data also lets us see that we have plenty of heap to spare, but that the memory allocation rate is quite high at ~3 GB/s. The Garbage Collector (G1) does a good job, but has to work hard. The vertical lines in the chart below are Garbage Collections.

But what is ElasticSearch using this CPU and allocated memory for? This is where things get interesting.

Drilling Down

The following Flame Chart (or Icicle Chart, if you prefer) shows how CPU time was spent in different methods and their dependents.

The most interesting part is in IndexShard.applyIndexOperation and methods called by it. Roughly 50% of the total CPU time is used by Lucene to build its inverted index - this is in the IndexWriter.addDocument tree. What surprised me is that ~35% of the CPU time - almost as much as indexing - was spent on parsing JSON.

What surprised me more is that roughly 10% of the total CPU time was spent parsing date strings.

There are two different code paths for this, since there are two date fields in the input data with different levels of nesting. Both of them end up in DateTimeFormatter$ClassicFormat.parseObject, which is a JVM standard library method for parsing dates.

Memory allocation

Let's have a look at the memory allocations done by the same code. Here we are looking at the amount of memory allocated, not the number of allocations.

Here we can see the same split between Lucene and JSON parsing. However, Lucene is only responsible for ~8% of the allocated memory. This speaks to the large amounts of effort that has been put into making Lucene memory efficient (especially at indexing).

Instead, ~70% of all memory is allocated when parsing JSON. ~32% of the total allocated memory is allocated to parse date strings.

What is the date parser doing?

Let's dig in and understand a bit more why date parsing takes so much time and memory. Elasticsearch has a date field, which supports multiple date formats. The index template used in the benchmark does not specify a specific format, which means that it is defaulting to the default format of strict_date_optional_time||epoch_millis. The || in the middle means that ElasticSearch will first try parsing the date as the first format, and if failed try the next.

The dates in the benchmark data are ISO formatted (well formed logs, nice!), which means that the parser being used is strict_date_optional_time, defined in DateFormatters.java.

Isolating the code

I extracted this code into a separate repository for demonstration and benchmarking purposes. The code is available at antonha/es-date-parse-bench-nov-2023. This is what the CPU Flame Graph looks like if we only parse dates in a loop, using the same date parser:

This looks the same as the profiling data above, just more zoomed in. We can see that it seems like a lot of the time is spent in the DateTimeFormatterBuilder$CompositePrinterParser , copying the Parsed object. This is done when the CompositePrinterParserparses optional parts of the text - presumably to not be able to revert if the optional part is not there. The strict_date_optional_time has 8 optional parts, with a max optional "depth" of 4 (which corresponds to the depth in the Flame Graph).

Can we do better?

It seemed to me like this could be done faster. In order to benchmark some alternatives, I set up a JMH benchmark in the previously mentioned antonha/es-date-parse-bench-nov-2023. JMH is a benchmarking tool provided by openjdk.

I benchmarked parsing a few different ISO 8601-formatted date-time strings, all with UTC time zone (e.g. 2023-01-01T23:38:34.000Z). This is probably the most common format for logs, and worth optimising for.

The strict_date_optional_time parser ran at about 1100ns per parsed date.
I tried using the strict_date_time parser, which has none of the optional stuff and thus might be faster. It was faster, but only about 2x at ~500ns per parsed date. This is just slightly slower than Instant.parse, a standard library method that should do the same thing.
I wanted to see if Regex could be faster, so I implemented a very basic variant. It clocked in at about 260ns per parsed date.
I then decided to write a straight forward parser, essentially just a bunch of String.charAt with if-statements. This parser became gradually less simple as I handled more error states, it ended up at roughly 200 lines of code. And it runs at about 22ns per parsed date, or about 50x faster than the strict_date_optional_time.
After implementing said parser, I thought of Googling for "fast java date parser". I found the ethlo/itu repository which provides a date parser much faster than the standard library. It was slightly slower than the parser I implemented, at about 35ns per parsed date. However, as far as I can tell, it is not able to be as lenient in parsing as strict_date_optional_time (this might be a good thing). I decided to continue using the parser I implemented for continued testing.

Here are the full benchmark results:

Benchmark	(dateString)	Mode	Cnt	Score	Error	Units
Benchmark.benchCharParser	2023-01-01T23:38:34.000Z	avgt	5	22.197	±	0.229	ns/op
Benchmark.benchCharParser	1970-01-01T00:16:12.675Z	avgt	5	22.129	±	0.170	ns/op
Benchmark.benchCharParser	5050-01-01T12:02:01.123Z	avgt	5	22.235	±	0.288	ns/op
Benchmark.benchESParse	2023-01-01T23:38:34.000Z	avgt	5	1108.601	±	16.202	ns/op
Benchmark.benchESParse	1970-01-01T00:16:12.675Z	avgt	5	1115.164	±	40.150	ns/op
Benchmark.benchESParse	5050-01-01T12:02:01.123Z	avgt	5	1107.708	±	14.095	ns/op
Benchmark.benchESStrictParse	2023-01-01T23:38:34.000Z	avgt	5	500.326	±	3.116	ns/op
Benchmark.benchESStrictParse	1970-01-01T00:16:12.675Z	avgt	5	496.774	±	4.574	ns/op
Benchmark.benchESStrictParse	5050-01-01T12:02:01.123Z	avgt	5	498.732	±	4.134	ns/op
Benchmark.benchITUParser	2023-01-01T23:38:34.000Z	avgt	5	35.242	±	1.965	ns/op
Benchmark.benchITUParser	1970-01-01T00:16:12.675Z	avgt	5	35.028	±	0.434	ns/op
Benchmark.benchITUParser	5050-01-01T12:02:01.123Z	avgt	5	34.107	±	0.367	ns/op
Benchmark.benchInstantParse	2023-01-01T23:38:34.000Z	avgt	5	470.640	±	1.939	ns/op
Benchmark.benchInstantParse	1970-01-01T00:16:12.675Z	avgt	5	484.131	±	4.044	ns/op
Benchmark.benchInstantParse	5050-01-01T12:02:01.123Z	avgt	5	482.272	±	2.527	ns/op
Benchmark.benchRegex	2023-01-01T23:38:34.000Z	avgt	5	283.709	±	4.295	ns/op
Benchmark.benchRegex	1970-01-01T00:16:12.675Z	avgt	5	271.189	±	0.452	ns/op
Benchmark.benchRegex	5050-01-01T12:02:01.123Z	avgt	5	274.563	±	7.238	ns/op

The source code for the benchmark is available in Benchmark.java of the repository above. The benchmark tests each solution with 3 different date strings.

Show me the code

The code for the CharDateParser is below. I do not claim that this code is pretty. I also don't claim that it is the fastest possible implementation - I'm 100% certain that it is not. But it is much faster than the standard date parser, and it is not that much code.

public class CharDateParser {

    public static TemporalAccessor parse(String dateString) {

        //Year
        if(dateString.length() < 4) {
            return null;
        }
        char y0 = dateString.charAt(0);
        int y1 = charToInt(dateString.charAt(1));
        int y2 = charToInt(dateString.charAt(2));
        int y3 = charToInt(dateString.charAt(3));
        if(y0 < '0' || y0 > '9' || y1 < 0 || y1 > 9 || y2 < 0 || y2 > 9 || y3 < 0 || y3 > 9) {
            return null;
        }
        int year = charToInt(y0) * 1000 + y1 * 100 + y2 * 10 + y3;


        //Month
        if(dateString.length() == 4 || dateString.charAt(4) != '-') {
            return LocalDate.of(year, 1, 1);
        }
        int m0 = charToInt(dateString.charAt(5));
        int m1 = charToInt(dateString.charAt(6));
        if(m0 < 0 || m0 > 1 || m1 < 0 || m1 > 9) {
            return LocalDate.of(year, 1, 1);
        }
        int month = m0 * 10 + m1;

        //Day
        if(dateString.length() == 7 || dateString.charAt(7) != '-') {
            return LocalDate.of(year, month, 1);
        }
        int d0 = charToInt(dateString.charAt(8));
        int d1 = charToInt(dateString.charAt(9));
        if(d0 < 0 || d0 > 3 || d1 < 0 || d1 > 9) {
            return LocalDate.of(year, month, 1);
        }
        int day = d0 * 10 + d1;

        //Hour
        if(dateString.length() == 10 || dateString.charAt(10) != 'T') {
            return LocalDate.of(year, month, day);
        }
        int h0 = charToInt(dateString.charAt(11));
        int h1 = charToInt(dateString.charAt(12));
        if(h0 < 0 || h0 > 9 || h1 < 0 || h1 > 9) {
            return LocalDate.of(year, month, day);
        }
        int hour = h0 * 10 + h1;

        //Minute
        if(dateString.length() == 13 || dateString.charAt(13) != ':') {
            return withZone(
                year, month, day, hour, 0, 0, 0,
                dateString, 13
            );
        }
        int mi0 = charToInt(dateString.charAt(14));
        int mi1 = charToInt(dateString.charAt(15));
        if(mi0 < 0 || mi0 > 5 || mi1 < 0 || mi1 > 9) {
            return withZone(
                year, month, day, hour, 0, 0, 0,
                dateString, 13
            );
        }
        int minute = mi0 * 10 + mi1;

        //Second
        if(dateString.length() == 16 || dateString.charAt(16) != ':') {
            return withZone(
                year, month, day, hour, minute, 0, 0,
                dateString, 16
            );
        }
        int s0 = charToInt(dateString.charAt(17));
        int s1 = charToInt(dateString.charAt(18));
        if(s0 < 0 || s0 > 5 || s1 < 0 || s1 > 9) {
            return withZone(
                year, month, day, hour, minute, 0, 0,
                dateString, 16
            );
        }
        int second = s0 * 10 + s1;

        if(dateString.length() == 19 || dateString.charAt(19) != '.') {
            return withZone(
                year, month, day, hour, minute, second, 0,
                dateString, 19
            );
        }
        int nanos = 0;
        int pos = 20;
        while (pos < dateString.length()) {
            int num = charToInt(dateString.charAt(pos));
            if(num < 0 || num > 9) {
                break;
            }
            nanos = nanos * 10 + num;
            pos++;
        }
        int j = pos - 20;
        while(j < 9 ) {
            nanos *= 10;
            j++;
        }
        return withZone(
            year, month, day, hour, minute, second, nanos,
            dateString, pos
        );
    }

    private static TemporalAccessor withZone(
        int year, int month, int day,
        int hour, int minute, int second, int nanos,
        String dateString, int pos
    ) {
        if (pos == dateString.length()) {
            return LocalDateTime.of(year, month, day, hour, minute, second, nanos);
        }
        ZoneId zone = parseZone(dateString, pos);
        if (zone != null) {
            return ZonedDateTime.of(
                year, month, day, hour, minute, second, nanos,
                zone
            );
        } else {
            return LocalDateTime.of(year, month, day, hour, minute, second, nanos);
        }
    }

    private static ZoneId parseZone(String dateString, int pos) {
        //Fast-track for UTC, since we don't need to create a substring.
        //Would love to have slices in Java.
        if(dateString.charAt(pos) == 'Z') {
            return ZoneOffset.UTC;
        }
        try {
            return ZoneId.of(dateString.substring(pos));
        } catch (DateTimeException e) {
            return null;
        }
    }

    private static int charToInt(char c) {
        return c - '0';
    }
}

This code parses dates the same as the strict_date_optional_time parser, as far as I can tell. It is likely that it misses some edge cases - I would happily accept PRs fixing those. I did set up unit tests, which verifies that it behaves the same as the strict_date_optional_time parser for some 230 000 different randomly-generated date strings with different supported patterns and time zones, so it can't be too far off.

Using the custom parser

I also tried to build Elasticsearch using my custom parser - it was fairly easy to get it working once I told the Elasticsearch build setup that I didn't care about its sanity checks.

First, I tried to index 10,000,000 documents with each parser, clearing the data between runs. With stock Elasticsearch, this took 1m25s (116,353 docs/s). With the custom parser, it took 1m12s (138,308 docs/s). This is a ~18% improvement.
The one minute is a bit short to get a good sense of the long term performance. Lucene spends more time merging as the index grows, and thus a smaller proportion of CPU would be spent on parsing new documents. I tested indexing 200,000,000 documents - again clearing the data between runs. With stock Elasticsearch, this took 29m5s (114,601 docs/s) and with the custom parser it took 26m36s (125,250 docs/s). This is a ~9% improvement.

It is always comforting when benchmarks show that you can improve as much as the profiling predicted.

The integration I made was very hacky - not good enough for a PR to Elasticsearch. I will create an issue for Elasticsearch. I will also happily contribute with/to a PR, as long as that does not prevent me from also contributing to OpenSearch on this. I do believe that OpenSearch has the same performance issue, since most of the date parsing code is way older than the fork.

Edit, 2023-11-13: Issue for Elasticsearch opened

Edit, also 2023-11-13: I did also did open an issue for OpenSearch , but it turns out that another person had already identified the same problem in #8361. They already had some POC code - my code and theirs have a lot of similarities - and it does look like theirs is also way faster than the standard library parsers.

Potential impact

Using a faster date parser can decrease indexing time for data with date fields quite a bit. How much will depend on how much of your data are date strings. If you have 10 date strings and no significant text date parsing will matter much more than if you have very long text fields and one timestamp. The impact will also depend on if you are CPU-bound or IO-bound for indexing - with slower storage or network parsing speed might not matter.

Often, searches are made against nodes at the same time as they are indexing. This means that the CPU usage and memory pressure caused by indexing affects search performance. A faster date parser would thus also improve search performance - but it is really hard to predict the impact of that. With low indexing rates, the impact would probably be negligible.

Takeaways

I was honestly quite surprised about what I found when investigating for this blog post, in a few different ways.

I'm surprised that the standard java.time parsers performed so poorly. Java is clearly capable of parsing dates faster. To be fair, the standard library is way more flexible than my custom parser. But if it was more optimised, a lot of compute could be saved, since it is so well used.
Given its slow performance, it surprises me that both Elasticsearch (and probably OpenSearch) uses the standard library for date parsing. It is such a low hanging fruit to improve indexing performance. For some this would barely matter, but for others it would be a huge improvement.
Lastly, I'm surprised that I didn't know about the slowness of the date parsers - even though I have worked with Java for 10+ years. I could have easily made the same mistake in my own code. This is why I think that always-on profilers in production is such a great tool - I learn about such performance regressions quickly.

Parsing date strings is a quite common operation - there are probably millions of JVM processes doing this across the world. Each of those processes uses energy - I can't help but wonder how much electricity is used for parsing dates. I hope that it is not very much.

If you wonder if you are spending all of your computing power on something as easily fixed, I do run a company which can help. Blunders.io is a profiler that you can run in production. I'm happy to help you understand your performance issues, regardless of if you became a user or not. Reach out to me at anton@blunders.io if you are interested.