Lucene is a widely used Java library for free text search library. It powers more directly used
search tools such as Solr and Elasticsearch, but is also used to add free text search to e.g.
Neo4j. There is a set of nightly benchmarks set up by Mike McCandless that aims to make sure
that there is no (unexpected) performance regressions. [1][2]
The performance benchmarks test both indexing and searching. That there exists a recurring
benchmark for such a performance-sensitive library is fantastic. That it is being
actively used and acted upon by Lucene contributors is even better. People take action when
performance degrades, and try to figure out why.
In this post, we'll be looking at profiling data, from Flight Recorder, of these benchmarks. [3] The data is from Lucene's master branch on 2021-01-10, commit
7e94a56e815f28419da61eaee1e13b92a9652338.
I tried to re-create the nightly benchmarks to my best ability, but I might have made some
mistakes.
I have run the benchmarks with Flight Recorder enabled, and analyzed the profiling data using
Blunders (the page you are on). If you would like to profile your applications in real time,
please reach out to me. But for now, let's dig into the data.
Indexing
Let's start with indexing. This benchmark indexes about 10M documents from the English
Wikipedia.
A first observation is that the CPU utilization starts at around 25% and rises to 50% for a
while. 25% means 25% of the available CPU cores, which my not-exactly-brand-new desktop has 4 of
(counting hyper threading).
We can conclude that whatever happens in the first part of the benchmark is most likely
CPU-bound (on this machine) and single-threaded. The second part looks like it uses two threads.
Feel free to click around in the chart a bit.
0%
20%
40%
60%
80%
100%
12:55:0013:00:0013:05:0013:10:0013:15:00
One other interesting observation from this chart is that the allocation rate (i.e. allocations
of new objects in the JVM is rather manageable, mostly staying in the range of 170 to 300
Megabytes per second. The chart also shows garbage collections: you would probably have to zoom
in to see them though, since the garbage collector (G1 in this case) has no issues keeping up
with the allocation.
CPU Drill-down
So, where is that CPU spent? Lets drill down in an (upside down) flame graph. This is
essentially a stack trace, where the parent (above) calls it's children (below). The width of
the item shows how often the JVM spent time in that code path. You can click on items to see
only that part of the chart.
Â
In order to make sense of this drill-down, it helps to know that Lucene indexing works by first
indexing documents into small, immutable, segments and then merging those segments into larger
segments.
Knowing that, we can figure out that roughly 72% of the CPU time is spent in indexing
the first segment and about 25% of the CPU time is spent merging. These numbers differ over,
in the beginning much less merging happens. This also accounts for the different CPU usages in
the chart above; in this benchmark, indexing is limited to one thread.
HNSWhat?
Interestingly, in both the initial indexing and merging a significant portion of the CPU
(remember that we are CPU-bound) is spent in building something called a HNSW Graph. So, what is
that?
Turns out that Hierarchical Navigable Small World graphs are data structures, useful for
searching based on document similarity. They were recently merged into Lucene's master branch.
They seem to have been added to the Lucene nightly benchmarks on 2020-12-09 and promptly cause
the indexing throughput to drop abruptly. This is consistent with what the profiler tells us. [4]
That the HNSW graphs are included in the indexing benchmark is of course great news for anyone
wanting to use them in the future; it highlights performance issues in the implementation. But
it is probably a bit problematic that it is such a big part of the benchmark: any other
indexing performance degradation or improvement would seem to be much less significant.
So, for users who do not want HNSW graphs, the benchmarks become less useful.
HNSW.search (and it's children)
A large part of the time in building HNSW graphs seems to come from a method called
HnswGraph.search, which seems to be finding neighbours to a document, in order to be able to
insert the document in the correct place in the graph. The same method is actually used when
at search-time.
A lot of time seems to be spent in dealing with SparseFixedBitSets, which here seems to be used
to keep track of which nodes have already been seen (nodes are identified by java ints). If I
were to try to improve the performance, I would explore if a simpler BitSet implementation might
help performance. My understanding is that the BitSet's size is O(num_docs) in a segment
(when used for indexing). A naive, dense BitSet uses O(entries) bits. Even for relative large
indices, this might be fine if the BitSet was re-used when building the graph.
Another code path where a lot of time is spent is in doing a dot product to see how similar two
documents are. The inner loop here is unrolled, indicating that the author probably knew that
this was a hot loop. One thing that I am curious of is whether the author considered Math.fma -
it should in theory be faster on CPUs with FMA instructions. [5]
Surprises
The observant reader might have noticed that part of the CPU time was reported when estimating
the RAM usage for SparseFixedBitSets, by the LinkedHashMap.get method. This is weird, since
the the RamUsageEstimator.shallowSizeOfArray method does not use a LinkedHashMap. It does use
a IdentityHashMap - but those are not the same thing. Java Mission Control reports the same
thing. This is rather disturbing - I believe that it is an instance of an OpenJDK bug
JDK-8201516,
which says that the debug symbols emitted by the JVM sometimes are wrong. [6]
So, can't we trust the Flight Recorder? It seems like no, not to 100% at least. On the other
hand I have used it for many years and have been able to predict how to improve performance. My
personal guess is that it likely is decently accurate from a birds-eye view, but that it might
mess up a bit when the JIT starts inlining. But what do I know? I've never touched the OpenJDK
source code. But maybe I should.
Searching
Search performance is arguably more important than indexing performance; after all, that's why
we have indices (as opposed to just using grep). Searches are made on 2 cores (out of 4) in
this benchmark, and the profiling data makes it evident that this is CPU-bound:
The test desktop is running a decently new SSD, but it's nothing fancy. Thus it might be
surprising that CPU usage is what seems to be the most scarce resource; one might assume that
we would be more IO-bound.
Since the search benchmarks tests many different kinds of queries, it's not obvious to point out
a single source of slowness:
Â
This points to a fundamental rule of performance; it depends on input. Different kinds of Lucene
queries will end up in different query paths and have vastly different performance
characteristic. These nightly benchmarks is very good at predicting the speed of e.g. certain
phrase queries, but they can't predict why your Kibana instance is slow.
Memory
Memory allocations in these nightly benchmarks is very low, at only about 115 MB/s. The worst I
have seen from Lucene is somewhere around 3GB/s. This was obviously with very different queries.
By looking at the allocation breakdown, we can see one way to allocate about half the amount of
memory:
Â
The TermGroupSelector does some kind of grouping of terms, which are identified as an integer:
Now, when you insert a int into a regular Java collection it is casted to a Integer object,
since
the Collection does not deal with primitives. This cast is what's causing 50% of the memory
allocations. There are libraries that provide collections for primitives, e.g. fastutil. Using
something like that could help. [7]
Thoughts
This has mostly been me spelunking into code that I don't work with. Reading code is great for
learning how to build systems - maybe looking at many profile results is good for learning how
to work with performance?
As mentioned, I have worked with Lucene before. The profiling data from there looked nothing
like
profiling data in the benchmark; it allocated way more memory. Both documents and queries were
different.
Choosing what to include in a benchmark specifies what kind of performance that your project
cares about. A reasonable approach to do this is to guess what kinds of queries are the most
common in real-world usage; I believe that this is what Lucene does. But it also means that
fringe use-cases are not always covered. If you have problems with Lucene, Elasticsearch, or for
that matter _any_ program, I highly encourage you to profile it.
Finally, I would like to thank everyone who has contributed to Lucene. It's a great library.