Lucene is a widely used Java library for free text search library. It powers more directly used search tools such as Solr and Elasticsearch, but is also used to add free text search to e.g. Neo4j. There is a set of nightly benchmarks set up by Mike McCandless that aims to make sure that there is no (unexpected) performance regressions.  
The performance benchmarks test both indexing and searching. That there exists a recurring benchmark for such a performance-sensitive library is fantastic. That it is being actively used and acted upon by Lucene contributors is even better. People take action when performance degrades, and try to figure out why.
In this post, we'll be looking at profiling data, from Flight Recorder, of these benchmarks.  The data is from Lucene's master branch on 2021-01-10, commit 7e94a56e815f28419da61eaee1e13b92a9652338. I tried to re-create the nightly benchmarks to my best ability, but I might have made some mistakes.
I have run the benchmarks with Flight Recorder enabled, and analyzed the profiling data using Blunders (the page you are on). If you would like to profile your applications in real time, please reach out to me. But for now, let's dig into the data.
Let's start with indexing. This benchmark indexes about 10M documents from the English Wikipedia. A first observation is that the CPU utilization starts at around 25% and rises to 50% for a while. 25% means 25% of the available CPU cores, which my not-exactly-brand-new desktop has 4 of (counting hyper threading).
We can conclude that whatever happens in the first part of the benchmark is most likely CPU-bound (on this machine) and single-threaded. The second part looks like it uses two threads. Feel free to click around in the chart a bit.
One other interesting observation from this chart is that the allocation rate (i.e. allocations of new objects in the JVM is rather manageable, mostly staying in the range of 170 to 300 Megabytes per second. The chart also shows garbage collections: you would probably have to zoom in to see them though, since the garbage collector (G1 in this case) has no issues keeping up with the allocation.
So, where is that CPU spent? Lets drill down in an (upside down) flame graph. This is essentially a stack trace, where the parent (above) calls it's children (below). The width of the item shows how often the JVM spent time in that code path. You can click on items to see only that part of the chart.
In order to make sense of this drill-down, it helps to know that Lucene indexing works by first indexing documents into small, immutable, segments and then merging those segments into larger segments.
Knowing that, we can figure out that roughly 72% of the CPU time is spent in indexing the first segment and about 25% of the CPU time is spent merging. These numbers differ over, in the beginning much less merging happens. This also accounts for the different CPU usages in the chart above; in this benchmark, indexing is limited to one thread.
Interestingly, in both the initial indexing and merging a significant portion of the CPU (remember that we are CPU-bound) is spent in building something called a HNSW Graph. So, what is that?
Turns out that Hierarchical Navigable Small World graphs are data structures, useful for searching based on document similarity. They were recently merged into Lucene's master branch. They seem to have been added to the Lucene nightly benchmarks on 2020-12-09 and promptly cause the indexing throughput to drop abruptly. This is consistent with what the profiler tells us. 
That the HNSW graphs are included in the indexing benchmark is of course great news for anyone wanting to use them in the future; it highlights performance issues in the implementation. But it is probably a bit problematic that it is such a big part of the benchmark: any other indexing performance degradation or improvement would seem to be much less significant. So, for users who do not want HNSW graphs, the benchmarks become less useful.
A large part of the time in building HNSW graphs seems to come from a method called HnswGraph.search, which seems to be finding neighbours to a document, in order to be able to insert the document in the correct place in the graph. The same method is actually used when at search-time.
A lot of time seems to be spent in dealing with SparseFixedBitSets, which here seems to be used to keep track of which nodes have already been seen (nodes are identified by java ints). If I were to try to improve the performance, I would explore if a simpler BitSet implementation might help performance. My understanding is that the BitSet's size is O(num_docs) in a segment (when used for indexing). A naive, dense BitSet uses O(entries) bits. Even for relative large indices, this might be fine if the BitSet was re-used when building the graph.
Another code path where a lot of time is spent is in doing a dot product to see how similar two documents are. The inner loop here is unrolled, indicating that the author probably knew that this was a hot loop. One thing that I am curious of is whether the author considered Math.fma - it should in theory be faster on CPUs with FMA instructions. 
The observant reader might have noticed that part of the CPU time was reported when estimating the RAM usage for SparseFixedBitSets, by the LinkedHashMap.get method. This is weird, since the the RamUsageEstimator.shallowSizeOfArray method does not use a LinkedHashMap. It does use a IdentityHashMap - but those are not the same thing. Java Mission Control reports the same thing. This is rather disturbing - I believe that it is an instance of an OpenJDK bug JDK-8201516, which says that the debug symbols emitted by the JVM sometimes are wrong. 
So, can't we trust the Flight Recorder? It seems like no, not to 100% at least. On the other hand I have used it for many years and have been able to predict how to improve performance. My personal guess is that it likely is decently accurate from a birds-eye view, but that it might mess up a bit when the JIT starts inlining. But what do I know? I've never touched the OpenJDK source code. But maybe I should.
Search performance is arguably more important than indexing performance; after all, that's why we have indices (as opposed to just using grep). Searches are made on 2 cores (out of 4) in this benchmark, and the profiling data makes it evident that this is CPU-bound:
The test desktop is running a decently new SSD, but it's nothing fancy. Thus it might be surprising that CPU usage is what seems to be the most scarce resource; one might assume that we would be more IO-bound.
Since the search benchmarks tests many different kinds of queries, it's not obvious to point out a single source of slowness:
This points to a fundamental rule of performance; it depends on input. Different kinds of Lucene queries will end up in different query paths and have vastly different performance characteristic. These nightly benchmarks is very good at predicting the speed of e.g. certain phrase queries, but they can't predict why your Kibana instance is slow.
Memory allocations in these nightly benchmarks is very low, at only about 115 MB/s. The worst I have seen from Lucene is somewhere around 3GB/s. This was obviously with very different queries.
By looking at the allocation breakdown, we can see one way to allocate about half the amount of memory:
The TermGroupSelector does some kind of grouping of terms, which are identified as an integer:
private final Map<Integer, Integer> ordsToGroupIds = new HashMap<>();
Now, when you insert a int into a regular Java collection it is casted to a Integer object, since the Collection does not deal with primitives. This cast is what's causing 50% of the memory allocations. There are libraries that provide collections for primitives, e.g. fastutil. Using something like that could help. 
This has mostly been me spelunking into code that I don't work with. Reading code is great for learning how to build systems - maybe looking at many profile results is good for learning how to work with performance?
As mentioned, I have worked with Lucene before. The profiling data from there looked nothing like profiling data in the benchmark; it allocated way more memory. Both documents and queries were different.
Choosing what to include in a benchmark specifies what kind of performance that your project cares about. A reasonable approach to do this is to guess what kinds of queries are the most common in real-world usage; I believe that this is what Lucene does. But it also means that fringe use-cases are not always covered. If you have problems with Lucene, Elasticsearch, or for that matter _any_ program, I highly encourage you to profile it.
Finally, I would like to thank everyone who has contributed to Lucene. It's a great library.