2021-01-12 - What does Lucene Benchmark?

Lucene is a widely used Java library for free text search library. It powers more directly used search tools such as Solr and Elasticsearch, but is also used to add free text search to e.g. Neo4j. There is a set of nightly benchmarks set up by Mike McCandless that aims to make sure that there is no (unexpected) performance regressions. [1] [2]

The performance benchmarks test both indexing and searching. That there exists a recurring benchmark for such a performance-sensitive library is fantastic. That it is being actively used and acted upon by Lucene contributors is even better. People take action when performance degrades, and try to figure out why.

In this post, we'll be looking at profiling data, from Flight Recorder, of these benchmarks. [3] The data is from Lucene's master branch on 2021-01-10, commit 7e94a56e815f28419da61eaee1e13b92a9652338. I tried to re-create the nightly benchmarks to my best ability, but I might have made some mistakes.

I have run the benchmarks with Flight Recorder enabled, and analyzed the profiling data using Blunders (the page you are on). If you would like to profile your applications in real time, please reach out to me. But for now, let's dig into the data.

Indexing

Let's start with indexing. This benchmark indexes about 10M documents from the English Wikipedia. A first observation is that the CPU utilization starts at around 25% and rises to 50% for a while. 25% means 25% of the available CPU cores, which my not-exactly-brand-new desktop has 4 of (counting hyper threading).

We can conclude that whatever happens in the first part of the benchmark is most likely CPU-bound (on this machine) and single-threaded. The second part looks like it uses two threads. Feel free to click around in the chart a bit.

02040608010012:55:0013:00:0013:05:0013:10:0013:15:00

One other interesting observation from this chart is that the allocation rate (i.e. allocations of new objects in the JVM is rather manageable, mostly staying in the range of 170 to 300 Megabytes per second. The chart also shows garbage collections: you would probably have to zoom in to see them though, since the garbage collector (G1 in this case) has no issues keeping up with the allocation.

CPU Drill-down

So, where is that CPU spent? Lets drill down in an (upside down) flame graph. This is essentially a stack trace, where the parent (above) calls it's children (below). The width of the item shows how often the JVM spent time in that code path. You can click on items to see only that part of the chart.

Total
IndexThreads$IndexThread.run
IndexWriter.addDocument
IndexWriter.updateDocument
IndexWriter.updateDocuments
DocumentsWriter.updateDocuments
DocumentsWriter.postUpdate
DocumentsWriter.doFlush
DocumentsWriterPerThread.flush
IndexingChain.flush
IndexingChain.writeVectors
VectorValuesWriter.flush
Lucene90VectorWriter.writeField
Lucene90VectorWriter.writeGraph
HnswGraphBuilder.build
HnswGraphBuilder.addGraphNode
HnswGraph.search
SparseFixedBitSet.<init>
RamUsageEstimator.shallowSizeOf
RamUsageEstimator.shallowSizeOfArray
Collections$UnmodifiableMap.get
LinkedHashMap.get
HashMap.getNode
RamUsageEstimator.alignObjectSize
NeighborQueue.add
LongHeap.push
ArrayUtil.grow
ArrayUtil.oversize
VectorValues$SearchStrategy.compare
VectorUtil.dotProduct
SparseFixedBitSet.get
NeighborQueue.topScore
LongHeap.top
SparseFixedBitSet.set
SparseFixedBitSet.insertLong
NeighborQueue.insertWithOverflow
LongHeap.insertWithOverflow
LongHeap.updateTop
LongHeap.downHeap
VectorWriter$VectorValuesMerger$MergerRandomAccess.vectorValue
NeighborQueue.pop
LongHeap.pop
LongHeap.downHeap
HnswGraphBuilder.addDiverseNeighbors
HnswGraphBuilder.diversityUpdate
HnswGraphBuilder.findNonDiverse
VectorValues$SearchStrategy.compare
VectorUtil.dotProduct
HnswGraphBuilder.selectDiverse
HnswGraphBuilder.diversityCheck
VectorValues$SearchStrategy.compare
VectorUtil.dotProduct
FreqProxTermsWriter.flush
PerFieldPostingsFormat$FieldsWriter.write
BlockTreeTermsWriter.write
BlockTreeTermsWriter$TermsWriter.write
PushPostingsWriterBase.writeTerm
Lucene84PostingsWriter.startDoc
FreqProxFields$FreqProxPostingsEnum.nextDoc
Lucene84PostingsWriter.addPosition
PForUtil.encode
BlockTreeTermsWriter$TermsWriter.pushTerm
BlockTreeTermsWriter$TermsWriter.writeBlocks
TermsHashPerField.sortTerms
BytesRefHash.sort
MSBRadixSorter.sort
MSBRadixSorter.sort
MSBRadixSorter.radixSort
MSBRadixSorter.sort
MSBRadixSorter.radixSort
MSBRadixSorter.sort
MSBRadixSorter.radixSort
MSBRadixSorter.sort
MSBRadixSorter.radixSort
MSBRadixSorter.sort
MSBRadixSorter.computeCommonPrefixLengthAndBuildHistogram
MSBRadixSorter.reorder
DocumentsWriterPerThread.updateDocuments
IndexingChain.processDocument
IndexingChain.processField
IndexingChain$PerField.invert
TermsHashPerField.add
BytesRefHash.add
BytesRefHash.findHash
BytesRefHash.equals
ByteBlockPool.setBytesRef
BytesRef.bytesEquals
Arrays.equals
BytesRefHash.doHash
StringHelper.murmurhash3_x86_32
BytesRefHash.rehash
TermsHashPerField.positionStreamSlice
FreqProxTermsWriterPerField.addTerm
TermsHashPerField.writeVInt
TermsHashPerField.writeByte
FreqProxTermsWriterPerField.writeProx
TermsHashPerField.writeVInt
TermsHashPerField.writeByte
FilteringTokenFilter.incrementToken
LowerCaseFilter.incrementToken
StandardTokenizer.incrementToken
StandardTokenizerImpl.getNextToken
Character.codePointAt
Character.codePointAtImpl
StandardTokenizerImpl.zzRefill
AttributeSource.clearAttributes
PackedTokenAttributeImpl.clear
Field.tokenStream
Field$StringTokenStream.<init>
TokenStream.<init>
AttributeSource.<init>
Field$StringTokenStream.incrementToken
AttributeSource.clearAttributes
PackedTokenAttributeImpl.clear
CharTermAttributeImpl.getBytesRef
BytesRefBuilder.copyChars
UnicodeUtil.UTF16toUTF8
LineFileDocs.nextDoc
SimpleDateFormat.parse
SimpleDateFormat.subParse
ConcurrentMergeScheduler$MergeThread.run
ConcurrentMergeScheduler.doMerge
IndexWriter$IndexWriterMergeSource.merge
IndexWriter.merge
IndexWriter.mergeMiddle
SegmentMerger.merge
SegmentMerger.mergeWithLogging
SegmentMerger$$Lambda$148.2073157208.merge
SegmentMerger.lambda$merge$4
SegmentMerger.mergeVectorValues
VectorWriter.merge
VectorWriter.mergeVectors
Lucene90VectorWriter.writeField
Lucene90VectorWriter.writeGraph
HnswGraphBuilder.build
HnswGraphBuilder.addGraphNode
HnswGraph.search
SparseFixedBitSet.<init>
VectorValues$SearchStrategy.compare
VectorUtil.dotProduct
VectorWriter$VectorValuesMerger$MergerRandomAccess.vectorValue
Lucene90VectorReader$OffHeapVectorValues.vectorValue
Arrays.binarySearch
NeighborQueue.add
LongHeap.push
ArrayUtil.grow
SparseFixedBitSet.set
HnswGraphBuilder.addDiverseNeighbors
HnswGraphBuilder.selectDiverse
HnswGraphBuilder.diversityCheck
SegmentMerger$$Lambda$142.1852991137.merge
SegmentMerger.lambda$merge$1
SegmentMerger.mergeTerms
PerFieldPostingsFormat$FieldsWriter.merge
FieldsConsumer.merge
BlockTreeTermsWriter.write
BlockTreeTermsWriter$TermsWriter.write
PushPostingsWriterBase.writeTerm
MappedMultiFields$MappedMultiTermsEnum.postings
MappingMultiPostingsEnum.<init>
Lucene84PostingsWriter.startDoc
LineFileDocs$1.run
LineFileDocs.readDocs
BufferedReader.readLine
BufferedReader.readLine
BufferedReader.fill
InputStreamReader.read
StreamDecoder.read
StreamDecoder.implRead
CharsetDecoder.decode
UTF_8$Decoder.decodeLoop
UTF_8$Decoder.decodeArrayLoop
Indexer.main
Indexer._main
Indexer.countUniqueTerms
DirectoryReader.open
DirectoryReader.open
IndexWriter.getReader
DocumentsWriter.flushAllThreads
DocumentsWriter.doFlush
DocumentsWriterPerThread.flush
IndexingChain.flush
IndexingChain.writeVectors
VectorValuesWriter.flush
Lucene90VectorWriter.writeField
Lucene90VectorWriter.writeGraph
HnswGraphBuilder.build
HnswGraphBuilder.addGraphNode
HnswGraph.search

In order to make sense of this drill-down, it helps to know that Lucene indexing works by first indexing documents into small, immutable, segments and then merging those segments into larger segments.

Knowing that, we can figure out that roughly 72% of the CPU time is spent in indexing the first segment and about 25% of the CPU time is spent merging. These numbers differ over, in the beginning much less merging happens. This also accounts for the different CPU usages in the chart above; in this benchmark, indexing is limited to one thread.

HNSWhat?

Interestingly, in both the initial indexing and merging a significant portion of the CPU (remember that we are CPU-bound) is spent in building something called a HNSW Graph. So, what is that?

Turns out that Hierarchical Navigable Small World graphs are data structures, useful for searching based on document similarity. They were recently merged into Lucene's master branch. They seem to have been added to the Lucene nightly benchmarks on 2020-12-09 and promptly cause the indexing throughput to drop abruptly. This is consistent with what the profiler tells us. [4]

That the HNSW graphs are included in the indexing benchmark is of course great news for anyone wanting to use them in the future; it highlights performance issues in the implementation. But it is probably a bit problematic that it is such a big part of the benchmark: any other indexing performance degradation or improvement would seem to be much less significant. So, for users who do not want HNSW graphs, the benchmarks become less useful.

HNSW.search (and it's children)

A large part of the time in building HNSW graphs seems to come from a method called HnswGraph.search, which seems to be finding neighbours to a document, in order to be able to insert the document in the correct place in the graph. The same method is actually used when at search-time.

A lot of time seems to be spent in dealing with SparseFixedBitSets, which here seems to be used to keep track of which nodes have already been seen (nodes are identified by java ints). If I were to try to improve the performance, I would explore if a simpler BitSet implementation might help performance. My understanding is that the BitSet's size is O(num_docs) in a segment (when used for indexing). A naive, dense BitSet uses O(entries) bits. Even for relative large indices, this might be fine if the BitSet was re-used when building the graph.

Another code path where a lot of time is spent is in doing a dot product to see how similar two documents are. The inner loop here is unrolled, indicating that the author probably knew that this was a hot loop. One thing that I am curious of is whether the author considered Math.fma - it should in theory be faster on CPUs with FMA instructions. [5]

Surprises

The observant reader might have noticed that part of the CPU time was reported when estimating the RAM usage for SparseFixedBitSets, by the LinkedHashMap.get method. This is weird, since the the RamUsageEstimator.shallowSizeOfArray method does not use a LinkedHashMap. It does use a IdentityHashMap - but those are not the same thing. Java Mission Control reports the same thing. This is rather disturbing - I believe that it is an instance of an OpenJDK bug JDK-8201516, which says that the debug symbols emitted by the JVM sometimes are wrong. [6]

So, can't we trust the Flight Recorder? It seems like no, not to 100% at least. On the other hand I have used it for many years and have been able to predict how to improve performance. My personal guess is that it likely is decently accurate from a birds-eye view, but that it might mess up a bit when the JIT starts inlining. But what do I know? I've never touched the OpenJDK source code. But maybe I should.

Searching

Search performance is arguably more important than indexing performance; after all, that's why we have indices (as opposed to just using grep). Searches are made on 2 cores (out of 4) in this benchmark, and the profiling data makes it evident that this is CPU-bound:

02040608010014:55:0015:00:0015:05:0015:10:0015:15:0015:20:0015:25:0015:30:0015:35:00

The test desktop is running a decently new SSD, but it's nothing fancy. Thus it might be surprising that CPU usage is what seems to be the most scarce resource; one might assume that we would be more IO-bound.

Since the search benchmarks tests many different kinds of queries, it's not obvious to point out a single source of slowness:

Total
TaskThreads$TaskThread.run
SearchTask.go
IndexSearcher.search
IndexSearcher.searchAfter
IndexSearcher.search
IndexSearcher.search
IndexSearcher.search
BulkScorer.score
Weight$DefaultBulkScorer.score
Weight$DefaultBulkScorer.scoreAll
PhraseScorer$1.matches
SloppyPhraseMatcher.reset
SloppyPhraseMatcher.initPhrasePositions
SloppyPhraseMatcher.initSimple
PhrasePositions.firstPosition
PhrasePositions.nextPosition
Lucene84PostingsReader$BlockImpactsPostingsEnum.nextPosition
Lucene84PostingsReader$BlockImpactsPostingsEnum.skipPositions
Lucene84PostingsReader$BlockImpactsPostingsEnum.refillPositions
PForUtil.decode
ForUtil.decode
PriorityQueue.add
PriorityQueue.upHeap
SloppyPhraseMatcher.nextMatch
SloppyPhraseMatcher.advancePP
PhrasePositions.nextPosition
PriorityQueue.pop
SloppyPhraseMatcher.maxFreq
ExactPhraseMatcher.reset
ConjunctionDISI.nextDoc
ConjunctionDISI.doNext
Lucene84PostingsReader$BlockImpactsPostingsEnum.advance
Lucene84PostingsReader.findFirstGreater
Lucene84PostingsReader$BlockImpactsPostingsEnum.refillDocs
TermSpans.advance
Lucene84PostingsReader$EverythingEnum.advance
MultiLevelSkipListReader.skipTo
Lucene84PostingsReader$EverythingEnum.refillDocs
Lucene84PostingsReader$BlockImpactsPostingsEnum.nextDoc
Lucene84PostingsReader$BlockImpactsPostingsEnum.advance
BlockMaxConjunctionScorer$2.nextDoc
BlockMaxConjunctionScorer$2.advance
BlockMaxConjunctionScorer$2.doNext
ImpactsDISI.advance
Lucene84PostingsReader$BlockImpactsDocsEnum.advance
Lucene84PostingsReader$BlockImpactsDocsEnum.refillDocs
ImpactsDISI.advanceTarget
ImpactsDISI.advanceShallow
Lucene84PostingsReader$BlockImpactsDocsEnum.advanceShallow
Lucene84ScoreSkipReader.skipTo
MultiLevelSkipListReader.skipTo
WANDScorer$1.advance
WANDScorer.pushBackLeads
ImpactsDISI.advance
Lucene84PostingsReader$BlockImpactsDocsEnum.advance
WANDScorer.moveToNextCandidate
WANDScorer.updateMaxScoresIfNecessary
BlockMaxConjunctionScorer$2.advanceTarget
TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect
BlockMaxConjunctionScorer.score
TermScorer.score
LeafSimScorer.score
PhraseScorer.score
WANDScorer.score
TermScorer.score
SpanScorer.score
WANDScorer$2.matches
WANDScorer.advanceTail
WANDScorer.advanceTail
ImpactsDISI.advance
Lucene84PostingsReader$BlockImpactsDocsEnum.advance
Lucene84PostingsReader$BlockImpactsDocsEnum.refillDocs
ImpactsDISI.advanceTarget
WANDScorer.popTail
IntervalScorer$1.matches
IntervalFilter.nextInterval
OrderedIntervalsSource$OrderedIntervalIterator.nextInterval
TermIntervalsSource$1.nextInterval
Lucene84PostingsReader$EverythingEnum.nextPosition
Lucene84PostingsReader$EverythingEnum.skipPositions
Lucene84PostingsReader$EverythingEnum.refillPositions
PForUtil.decode
ByteBufferIndexInput$SingleBufferImpl.seek
IntervalFilter.nextDoc
ConjunctionIntervalIterator.nextDoc
ConjunctionDISI.nextDoc
ConjunctionDISI.doNext
TermIntervalsSource$1.advance
Lucene84PostingsReader$EverythingEnum.advance
MultiLevelSkipListReader.skipTo
Lucene84PostingsReader$EverythingEnum.refillDocs
OrderedIntervalsSource$OrderedIntervalIterator.reset
TermIntervalsSource$1.nextInterval
Lucene84PostingsReader$EverythingEnum.nextPosition
ConjunctionSpans$1.matches
NearSpansOrdered.twoPhaseCurrentDocMatches
NearSpansOrdered.stretchToOrder
NearSpansOrdered.advancePosition
TermSpans.nextStartPosition
Lucene84PostingsReader$EverythingEnum.nextPosition
Lucene84PostingsReader$EverythingEnum.skipPositions
Lucene84PostingsReader$EverythingEnum.refillPositions
PForUtil.decode
TermSpans.nextStartPosition
Lucene84PostingsReader$EverythingEnum.nextPosition
WANDScorer$1.nextDoc
WANDScorer$1.advance
WANDScorer.pushBackLeads
WANDScorer.insertTailWithOverFlow
WANDScorer.addTail
WANDScorer.moveToNextCandidate
ImpactsDISI.nextDoc
ImpactsDISI.advance
ConjunctionDISI.advance
ConjunctionDISI.doNext
Lucene84PostingsReader$BlockImpactsPostingsEnum.advance
ImpactsDISI.advanceTarget
TopFieldCollector$SimpleFieldCollector$1.collect
TopFieldCollector$TopFieldLeafCollector.thresholdCheck
FieldComparator$TermOrdValComparator.compareBottom
FieldComparator$TermOrdValComparator.getOrdForDoc
Lucene80DocValuesProducer$21.ordValue
MultiTermQueryConstantScoreWrapper$1.bulkScorer
MultiTermQueryConstantScoreWrapper$1.rewrite
DocIdSetBuilder.add
FixedBitSet.or
BitSet.or
Lucene84PostingsReader$BlockDocsEnum.nextDoc
IntersectTermsEnum.next
Weight.bulkScorer
PointRangeQuery$1.scorer
PointRangeQuery$1$4.get
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.addAll
BKDReader.addAll
BKDReader.addAll
BKDReader.addAll
BKDReader.addAll
BKDReader.addAll
IndexSearcher.rewrite
MultiTermQuery.rewrite
TopTermsRewrite.rewrite
TermCollectingRewrite.collectTerms
FuzzyTermsEnum.next
IntersectTermsEnum.next
IntersectTermsEnum._next
IntersectTermsEnum.pushFrame
IndexSearcher.search
BulkScorer.score
Weight$DefaultBulkScorer.score
Weight$DefaultBulkScorer.scoreAll
FirstPassGroupingCollector.collect
FirstPassGroupingCollector.isCompetitive
FieldComparator$RelevanceComparator.compareBottom
ScoreCachingWrappingScorer.score
TermScorer.score
LeafSimScorer.score
BM25Similarity$BM25Scorer.score
LeafSimScorer.getNormValue
Lucene80NormsProducer$3.longValue
TermGroupSelector.advanceTo
SecondPassGroupingCollector.collect
TermGroupSelector.advanceTo
Lucene80DocValuesProducer$21.ordValue
DirectReader$DirectPackedReader20.get
HashMap.containsKey
HashMap.getNode
GroupReducer.collect
BlockGroupingCollector.collect
TermScorer.score
LeafSimScorer.score
FieldComparator$RelevanceComparator.compareBottom
ScoreCachingWrappingScorer.score
TermScorer.score
Lucene84PostingsReader$BlockDocsEnum.nextDoc
Lucene84PostingsReader$BlockDocsEnum.refillDocs
ForDeltaUtil.decodeAndPrefixSum
ForUtil.decodeAndPrefixSum
SimpleCollector.getLeafCollector
FirstPassGroupingCollector.doSetNextReader
TermGroupSelector.setNextReader
Lucene80DocValuesProducer$BaseSortedDocValues.lookupTerm
Lucene80DocValuesProducer$TermsDict.seekCeil
FastTaxonomyFacetCounts.<init>
FastTaxonomyFacetCounts.countAll
Lucene80DocValuesProducer$15.binaryValue
DirectMonotonicReader.get
DirectReader$DirectPackedReader12.get
ByteBufferIndexInput$SingleBufferImpl.readShort
ByteBufferGuard.getShort
DirectByteBuffer.getShort
ByteBufferIndexInput.readBytes
ByteBufferGuard.getBytes
DirectByteBuffer.get
ByteBuffer.get
DirectByteBuffer.get
ByteBufferIndexInput$SingleBufferImpl.seek
IntTaxonomyFacets.increment
IntTaxonomyFacets.increment
FastTaxonomyFacetCounts.count
Lucene80DocValuesProducer$15.binaryValue
DirectMonotonicReader.get
SortedSetDocValuesFacetCounts.<init>
SortedSetDocValuesFacetCounts.<init>
SortedSetDocValuesFacetCounts.countAll
SortedSetDocValuesFacetCounts.countOneSegment
SingletonSortedSetDocValues.nextDoc
Lucene80DocValuesProducer$21.ordValue
DirectReader$DirectPackedReader12.get
DirectReader$DirectPackedReader4.get
FacetsCollector.search
FacetsCollector.doSearch
IndexSearcher.search
IndexSearcher.search
BulkScorer.score
Weight$DefaultBulkScorer.score
Weight$DefaultBulkScorer.scoreAll
MultiCollector$MultiLeafCollector.collect
TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect
FilterScorable.score
ScoreCachingWrappingScorer.score
TermScorer.score
LeafSimScorer.score
RespellTask.go
DirectSpellChecker.suggestSimilar
DirectSpellChecker.suggestSimilar
DirectSpellChecker.suggestSimilar
FuzzyTermsEnum.next
MultiTermsEnum.next
MultiTermsEnum.pushTop
IntersectTermsEnum.next
IntersectTermsEnum._next
PKLookupTask.go
SearchPerfTest.main
SearchPerfTest._main

This points to a fundamental rule of performance; it depends on input. Different kinds of Lucene queries will end up in different query paths and have vastly different performance characteristic. These nightly benchmarks is very good at predicting the speed of e.g. certain phrase queries, but they can't predict why your Kibana instance is slow.

Memory

Memory allocations in these nightly benchmarks is very low, at only about 115 MB/s. The worst I have seen from Lucene is somewhere around 3GB/s. This was obviously with very different queries.

By looking at the allocation breakdown, we can see one way to allocate about half the amount of memory:

Total
TaskThreads$TaskThread.run
SearchTask.go
IndexSearcher.search
IndexSearcher.searchAfter
IndexSearcher.search
IndexSearcher.search
IndexSearcher.search
BulkScorer.score
Weight$DefaultBulkScorer.score
Weight$DefaultBulkScorer.scoreAll
ImpactsDISI.nextDoc
ImpactsDISI.advance
ImpactsDISI.advanceTarget
MaxScoreCache.getMaxScoreForLevel
ExactPhraseMatcher$1$1.getImpacts
ArrayList.add
ArrayList.add
ArrayList.grow
ArrayList.grow
Arrays.copyOf
ExactPhraseMatcher$1$SubIterator.<init>
AbstractList.iterator
MaxScoreCache.getSkipUpTo
MaxScoreCache.getSkipLevel
MaxScoreCache.getMaxScoreForLevel
ExactPhraseMatcher$1.getImpacts
ExactPhraseMatcher$1$1.getImpacts
BlockMaxConjunctionScorer$2.nextDoc
BlockMaxConjunctionScorer$2.advance
BlockMaxConjunctionScorer$2.doNext
WANDScorer$1.advance
WANDScorer.moveToNextCandidate
WANDScorer.updateMaxScoresIfNecessary
WANDScorer.updateMaxScores
DisiPriorityQueue.iterator
AbstractList$SubList.iterator
AbstractList.listIterator
AbstractList$SubList.listIterator
AbstractList$SubList$1.<init>
AbstractList.listIterator
AbstractList.subList
Arrays.asList
TermScorer.getMaxScore
ImpactsDISI.getMaxScore
MaxScoreCache.getMaxScoreForLevel
MaxScoreCache.computeMaxScore
AbstractList.iterator
BlockMaxConjunctionScorer$2.advanceTarget
BlockMaxConjunctionScorer$2.moveToNextBlock
BlockMaxConjunctionScorer.getMaxScore
TermScorer.getMaxScore
ImpactsDISI.getMaxScore
MaxScoreCache.getMaxScoreForLevel
MaxScoreCache.computeMaxScore
AbstractList.iterator
WANDScorer$1.nextDoc
WANDScorer$1.advance
WANDScorer.moveToNextCandidate
WANDScorer.updateMaxScoresIfNecessary
WANDScorer.updateMaxScores
DisiPriorityQueue.iterator
AbstractList$SubList.iterator
AbstractList.listIterator
AbstractList$SubList.listIterator
TopFieldCollector$SimpleFieldCollector$1.collect
TopFieldCollector$TopFieldLeafCollector.collectCompetitiveHit
LongComparator$LongLeafComparator.setBottom
NumericComparator$NumericLeafComparator.setBottom
NumericComparator$NumericLeafComparator.updateCompetitiveIterator
BKDReader.estimatePointCount
BKDReader.getIntersectState
WANDScorer$2.matches
WANDScorer.advanceTail
WANDScorer.advanceTail
ImpactsDISI.advance
ImpactsDISI.advanceTarget
MaxScoreCache.getMaxScoreForLevel
MaxScoreCache.computeMaxScore
AbstractList.iterator
ConjunctionDISI.nextDoc
TopFieldCollector$TopFieldLeafCollector.setScorer
NumericComparator$NumericLeafComparator.setScorer
NumericComparator$NumericLeafComparator.updateCompetitiveIterator
BooleanWeight.bulkScorer
Weight.bulkScorer
BooleanWeight.scorer
BooleanWeight.scorerSupplier
Weight.scorerSupplier
TermQuery$TermWeight.scorer
SegmentTermsEnum.impacts
Lucene84PostingsReader.impacts
Lucene84PostingsReader.postings
Lucene84PostingsReader$BlockDocsEnum.<init>
Lucene84PostingsReader$BlockImpactsDocsEnum.<init>
TermQuery$TermWeight.getTermsEnum
FieldReader.iterator
SegmentTermsEnum.<init>
BooleanWeight.scorerSupplier
Weight.scorerSupplier
TermQuery$TermWeight.scorer
SegmentTermsEnum.impacts
Lucene84PostingsReader.impacts
Lucene84PostingsReader$BlockImpactsDocsEnum.<init>
MultiTermQueryConstantScoreWrapper$1.bulkScorer
MultiTermQueryConstantScoreWrapper$1.rewrite
DocIdSetBuilder.add
DocIdSetBuilder.grow
DocIdSetBuilder.upgradeToBitSet
FixedBitSet.<init>
DocIdSetBuilder.ensureBufferCapacity
DocIdSetBuilder.addBuffer
DocIdSetBuilder$Buffer.<init>
IntersectTermsEnum.next
IntersectTermsEnum._next
IntersectTermsEnum.pushFrame
FST.findTargetArc
Weight.bulkScorer
PointRangeQuery$1.scorer
PointRangeQuery$1$4.get
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
BKDReader.intersect
FixedBitSet.<init>
TermQuery$TermWeight.scorer
TermQuery$TermWeight.getTermsEnum
TermStates.get
TermStates.loadTermsEnum
SegmentTermsEnum.seekExact
PhraseWeight.scorer
PhraseQuery$1.getPhraseMatcher
SegmentTermsEnum.impacts
Lucene84PostingsReader.impacts
Lucene84PostingsReader$BlockImpactsPostingsEnum.<init>
IntervalQuery$IntervalWeight.scorer
FilteredIntervalsSource.intervals
ConjunctionIntervalsSource.intervals
TermIntervalsSource.intervals
IndexSearcher.createWeight
BooleanQuery.createWeight
BooleanWeight.<init>
IndexSearcher.createWeight
TermQuery.createWeight
TermStates.build
TermStates.loadTermsEnum
SegmentTermsEnum.seekExact
SegmentTermsEnum.pushFrame
SegmentTermsEnum.getFrame
SegmentTermsEnumFrame.<init>
BooleanQuery.createWeight
BooleanWeight.<init>
IndexSearcher.createWeight
TermQuery.createWeight
TermStates.build
TermStates.loadTermsEnum
SegmentTermsEnum.seekExact
PhraseQuery.createWeight
PhraseQuery$1.<init>
PhraseWeight.<init>
PhraseQuery$1.getStats
TermStates.build
TermStates.loadTermsEnum
IndexSearcher.rewrite
MultiTermQuery.rewrite
TopTermsRewrite.rewrite
TermCollectingRewrite.collectTerms
FuzzyTermsEnum.next
IntersectTermsEnum.next
IntersectTermsEnum._next
IntersectTermsEnum.pushFrame
FST.findTargetArc
FST.readArcByDirectAddressing
FST.readArcByDirectAddressing
FST.readArc
Outputs.readFinalOutput
ByteSequenceOutputs.read
ByteSequenceOutputs.read
MultiTermQuery$RewriteMethod.getTermsEnum
FuzzyQuery.getTermsEnum
FuzzyTermsEnum.<init>
FuzzyTermsEnum.<init>
IndexSearcher.search
BulkScorer.score
Weight$DefaultBulkScorer.score
Weight$DefaultBulkScorer.scoreAll
SecondPassGroupingCollector.collect
TermGroupSelector.advanceTo
Integer.valueOf
Weight.bulkScorer
TermQuery$TermWeight.scorer
SegmentTermsEnum.postings
Lucene84PostingsReader.postings
Lucene84PostingsReader$BlockDocsEnum.<init>
SimpleCollector.getLeafCollector
IndexSearcher.createWeight
TermQuery.createWeight
TermStates.build
TermStates.loadTermsEnum
SegmentTermsEnum.seekExact
SegmentTermsEnum.pushFrame
SegmentTermsEnum.getFrame
SegmentTermsEnumFrame.<init>
FacetsCollector.search
FacetsCollector.doSearch
IndexSearcher.search
IndexSearcher.search
BulkScorer.score
Weight$DefaultBulkScorer.score
Weight$DefaultBulkScorer.scoreAll
MultiCollector$MultiLeafCollector.collect
FacetsCollector.collect
DocIdSetBuilder.grow
DocIdSetBuilder.upgradeToBitSet
FixedBitSet.<init>
DocIdSetBuilder.ensureBufferCapacity
DocIdSetBuilder.addBuffer
DocIdSetBuilder$Buffer.<init>
RespellTask.go
DirectSpellChecker.suggestSimilar
DirectSpellChecker.suggestSimilar
DirectSpellChecker.suggestSimilar
FuzzyTermsEnum.next
MultiTermsEnum.next
MultiTermsEnum.pushTop
IntersectTermsEnum.next
IntersectTermsEnum._next
IntersectTermsEnum.pushFrame
FST.findTargetArc
FST.readArcByDirectAddressing
FST.readArcByDirectAddressing
FST.readArc
FuzzyTermsEnum.<init>
FuzzyTermsEnum.<init>
FuzzyTermsEnum.bottomChanged
FuzzyTermsEnum.getAutomatonEnum
MultiTerms.intersect
PKLookupTask.go
SegmentTermsEnum.seekExact
FST.findTargetArc
FST.readArcByDirectAddressing
FST.readArcByDirectAddressing
FST.readArc
Outputs.readFinalOutput
ByteSequenceOutputs.read
ByteSequenceOutputs.read
SearchPerfTest.main
SearchPerfTest._main
TaskParser.<init>
VectorDictionary.<init>
VectorDictionary.parseLine
String.split
String.split
String.substring
StringLatin1.newString
Float.parseFloat
FloatingDecimal.parseFloat
FloatingDecimal.readJavaFormatString
SearchTask.printResults
IndexSearcher.doc
IndexReader.document
BaseCompositeReader.document
CodecReader.document
CompressingStoredFieldsReader.visitDocument
CompressingStoredFieldsReader.document
CompressingStoredFieldsReader$BlockState.document
LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress
ArrayUtil.grow
ArrayUtil.growExact

The TermGroupSelector does some kind of grouping of terms, which are identified as an integer:

private final Map<Integer, Integer> ordsToGroupIds = new HashMap<>();

Now, when you insert a int into a regular Java collection it is casted to a Integer object, since the Collection does not deal with primitives. This cast is what's causing 50% of the memory allocations. There are libraries that provide collections for primitives, e.g. fastutil. Using something like that could help. [7]

Thoughts

This has mostly been me spelunking into code that I don't work with. Reading code is great for learning how to build systems - maybe looking at many profile results is good for learning how to work with performance?

As mentioned, I have worked with Lucene before. The profiling data from there looked nothing like profiling data in the benchmark; it allocated way more memory. Both documents and queries were different.

Choosing what to include in a benchmark specifies what kind of performance that your project cares about. A reasonable approach to do this is to guess what kinds of queries are the most common in real-world usage; I believe that this is what Lucene does. But it also means that fringe use-cases are not always covered. If you have problems with Lucene, Elasticsearch, or for that matter _any_ program, I highly encourage you to profile it.

Finally, I would like to thank everyone who has contributed to Lucene. It's a great library.

Footnotes

If you are more curious, there are more complete Blunders views of both the indexing profile and the search profile. If you want to use Blunders for your application, please get in touch.

References

  1. Apache Lucene
  2. Lucene nightly benchmarks
  3. Java Flight recorder
  4. LUCENE-9004
  5. Boosting Java Performance (with Math.fma)
  6. JDK-8201516
  7. fastutil