Class ES94BloomFilterDocValuesFormat
- All Implemented Interfaces:
org.apache.lucene.util.NamedSPILoader.NamedSPI
The field used to build the Bloom filter is not stored, as only its presence needs to be tracked for filtering purposes. This reduces storage overhead while maintaining the ability to quickly determine if a segment might contain the field.
File formats
Bloom filter doc values are represented by two files:
-
Bloom filter data file (extension .sfbf). This file stores the bloom filter bitset.
-
A bloom filter meta file (extension .sfbfm). This file stores metadata about the bloom filters stored in the bloom filter data file. The in-memory representation can be found in
ES94BloomFilterDocValuesFormat.BloomFilterMetadata.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final doublestatic final intstatic final doublestatic final intstatic final intstatic final Stringstatic final doublestatic final ByteSizeValuestatic final intstatic final doublestatic final ByteSizeValuestatic final intstatic final Stringstatic final String -
Constructor Summary
ConstructorsConstructorDescriptionES94BloomFilterDocValuesFormat(BigArrays bigArrays, String bloomFilterFieldName) ES94BloomFilterDocValuesFormat(BigArrays bigArrays, String bloomFilterFieldName, boolean optimizedMergeEnabled, int numHashFunctions, int smallSegmentMaxDocs, int largeSegmentMinDocs, double highBitsPerDoc, double lowBitsPerDoc, ByteSizeValue maxBloomFilterSize) -
Method Summary
Modifier and TypeMethodDescriptionintbloomFilterSizeInBytesForNewSegment(int numDocs) Computes the bloom filter size in bytes for a newly created segment.org.apache.lucene.codecs.DocValuesConsumerfieldsConsumer(org.apache.lucene.index.SegmentWriteState state) org.apache.lucene.codecs.DocValuesProducerfieldsProducer(org.apache.lucene.index.SegmentReadState state) Methods inherited from class org.apache.lucene.codecs.DocValuesFormat
availableDocValuesFormats, forName, getName, reloadDocValuesFormats, toString
-
Field Details
-
FORMAT_NAME
- See Also:
-
STORED_FIELDS_BLOOM_FILTER_EXTENSION
- See Also:
-
STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION
- See Also:
-
DEFAULT_NUM_HASH_FUNCTIONS
public static final int DEFAULT_NUM_HASH_FUNCTIONS- See Also:
-
MAX_NUM_HASH_FUNCTIONS
public static final int MAX_NUM_HASH_FUNCTIONS -
MIN_SEGMENT_DOCS
public static final int MIN_SEGMENT_DOCS- See Also:
-
DEFAULT_SMALL_SEGMENT_MAX_DOCS
public static final int DEFAULT_SMALL_SEGMENT_MAX_DOCS- See Also:
-
DEFAULT_LARGE_SEGMENT_MIN_DOCS
public static final int DEFAULT_LARGE_SEGMENT_MIN_DOCS- See Also:
-
DEFAULT_HIGH_BITS_PER_DOC
public static final double DEFAULT_HIGH_BITS_PER_DOC- See Also:
-
DEFAULT_LOW_BITS_PER_DOC
public static final double DEFAULT_LOW_BITS_PER_DOC- See Also:
-
MAX_BLOOM_FILTER_SIZE
-
MIN_BITS_PER_DOC
public static final double MIN_BITS_PER_DOC- See Also:
-
MAX_BITS_PER_DOC
public static final double MAX_BITS_PER_DOC- See Also:
-
MIN_BLOOM_FILTER_SIZE
-
-
Constructor Details
-
ES94BloomFilterDocValuesFormat
public ES94BloomFilterDocValuesFormat() -
ES94BloomFilterDocValuesFormat
-
ES94BloomFilterDocValuesFormat
public ES94BloomFilterDocValuesFormat(BigArrays bigArrays, String bloomFilterFieldName, boolean optimizedMergeEnabled, int numHashFunctions, int smallSegmentMaxDocs, int largeSegmentMinDocs, double highBitsPerDoc, double lowBitsPerDoc, ByteSizeValue maxBloomFilterSize)
-
-
Method Details
-
fieldsConsumer
public org.apache.lucene.codecs.DocValuesConsumer fieldsConsumer(org.apache.lucene.index.SegmentWriteState state) throws IOException - Specified by:
fieldsConsumerin classorg.apache.lucene.codecs.DocValuesFormat- Throws:
IOException
-
fieldsProducer
public org.apache.lucene.codecs.DocValuesProducer fieldsProducer(org.apache.lucene.index.SegmentReadState state) throws IOException - Specified by:
fieldsProducerin classorg.apache.lucene.codecs.DocValuesFormat- Throws:
IOException
-
bloomFilterSizeInBytesForNewSegment
public int bloomFilterSizeInBytesForNewSegment(int numDocs) Computes the bloom filter size in bytes for a newly created segment.The sizing strategy balances false-positive accuracy against storage cost depending on the segment's document count:
- Small segments (≤ 160000 docs by default) — Sized at 128.0 bits per document by default. Nominal saturation is ~3.1% (s = 1 − e^(−k/bpd) with k=4); power-of-two rounding typically inflates the actual filter, bringing effective saturation to ~1.9–3.1% depending on the document count.
- Mid-range segments (160000 – 320000 docs by default) — Bits per document tapers linearly from 128.0 down to 24.0. This avoids a sharp cliff in filter quality between adjacent segment sizes while gradually trading accuracy for a smaller storage footprint.
- Large segments (≥ 320000 docs by default) — Sized at a
flat 24.0 bits per document. At this scale the hard cap
at
MAX_BLOOM_FILTER_SIZEdominates; the filter cannot grow proportionally with doc count regardless of the bits-per-doc ratio.
The result is always passed through
boundAndRoundBloomFilterSizeInBytes(long)to enforce minimum/maximum size limits and power-of-two rounding.- Parameters:
numDocs- the number of documents in the new segment, must be positive- Returns:
- bloom filter size in bytes, as a power of two
-