Class ES94BloomFilterDocValuesFormat

java.lang.Object
org.apache.lucene.codecs.DocValuesFormat
org.elasticsearch.index.codec.bloomfilter.ES94BloomFilterDocValuesFormat
All Implemented Interfaces:
org.apache.lucene.util.NamedSPILoader.NamedSPI

public class ES94BloomFilterDocValuesFormat extends org.apache.lucene.codecs.DocValuesFormat
A doc values format that builds a Bloom filter for a specific field to enable fast existence checks.

The field used to build the Bloom filter is not stored, as only its presence needs to be tracked for filtering purposes. This reduces storage overhead while maintaining the ability to quickly determine if a segment might contain the field.

File formats

Bloom filter doc values are represented by two files:

  1. Bloom filter data file (extension .sfbf). This file stores the bloom filter bitset.

  2. A bloom filter meta file (extension .sfbfm). This file stores metadata about the bloom filters stored in the bloom filter data file. The in-memory representation can be found in ES94BloomFilterDocValuesFormat.BloomFilterMetadata.

  • Field Details

    • FORMAT_NAME

      public static final String FORMAT_NAME
      See Also:
    • STORED_FIELDS_BLOOM_FILTER_EXTENSION

      public static final String STORED_FIELDS_BLOOM_FILTER_EXTENSION
      See Also:
    • STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION

      public static final String STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION
      See Also:
    • DEFAULT_NUM_HASH_FUNCTIONS

      public static final int DEFAULT_NUM_HASH_FUNCTIONS
      See Also:
    • MAX_NUM_HASH_FUNCTIONS

      public static final int MAX_NUM_HASH_FUNCTIONS
    • MIN_SEGMENT_DOCS

      public static final int MIN_SEGMENT_DOCS
      See Also:
    • DEFAULT_SMALL_SEGMENT_MAX_DOCS

      public static final int DEFAULT_SMALL_SEGMENT_MAX_DOCS
      See Also:
    • DEFAULT_LARGE_SEGMENT_MIN_DOCS

      public static final int DEFAULT_LARGE_SEGMENT_MIN_DOCS
      See Also:
    • DEFAULT_HIGH_BITS_PER_DOC

      public static final double DEFAULT_HIGH_BITS_PER_DOC
      See Also:
    • DEFAULT_LOW_BITS_PER_DOC

      public static final double DEFAULT_LOW_BITS_PER_DOC
      See Also:
    • MAX_BLOOM_FILTER_SIZE

      public static final ByteSizeValue MAX_BLOOM_FILTER_SIZE
    • MIN_BITS_PER_DOC

      public static final double MIN_BITS_PER_DOC
      See Also:
    • MAX_BITS_PER_DOC

      public static final double MAX_BITS_PER_DOC
      See Also:
    • MIN_BLOOM_FILTER_SIZE

      public static final ByteSizeValue MIN_BLOOM_FILTER_SIZE
  • Constructor Details

    • ES94BloomFilterDocValuesFormat

      public ES94BloomFilterDocValuesFormat()
    • ES94BloomFilterDocValuesFormat

      public ES94BloomFilterDocValuesFormat(BigArrays bigArrays, String bloomFilterFieldName)
    • ES94BloomFilterDocValuesFormat

      public ES94BloomFilterDocValuesFormat(BigArrays bigArrays, String bloomFilterFieldName, boolean optimizedMergeEnabled, int numHashFunctions, int smallSegmentMaxDocs, int largeSegmentMinDocs, double highBitsPerDoc, double lowBitsPerDoc, ByteSizeValue maxBloomFilterSize)
  • Method Details

    • fieldsConsumer

      public org.apache.lucene.codecs.DocValuesConsumer fieldsConsumer(org.apache.lucene.index.SegmentWriteState state) throws IOException
      Specified by:
      fieldsConsumer in class org.apache.lucene.codecs.DocValuesFormat
      Throws:
      IOException
    • fieldsProducer

      public org.apache.lucene.codecs.DocValuesProducer fieldsProducer(org.apache.lucene.index.SegmentReadState state) throws IOException
      Specified by:
      fieldsProducer in class org.apache.lucene.codecs.DocValuesFormat
      Throws:
      IOException
    • bloomFilterSizeInBytesForNewSegment

      public int bloomFilterSizeInBytesForNewSegment(int numDocs)
      Computes the bloom filter size in bytes for a newly created segment.

      The sizing strategy balances false-positive accuracy against storage cost depending on the segment's document count:

      • Small segments (≤ 160000 docs by default) — Sized at 128.0 bits per document by default. Nominal saturation is ~3.1% (s = 1 − e^(−k/bpd) with k=4); power-of-two rounding typically inflates the actual filter, bringing effective saturation to ~1.9–3.1% depending on the document count.
      • Mid-range segments (160000320000 docs by default) — Bits per document tapers linearly from 128.0 down to 24.0. This avoids a sharp cliff in filter quality between adjacent segment sizes while gradually trading accuracy for a smaller storage footprint.
      • Large segments (≥ 320000 docs by default) — Sized at a flat 24.0 bits per document. At this scale the hard cap at MAX_BLOOM_FILTER_SIZE dominates; the filter cannot grow proportionally with doc count regardless of the bits-per-doc ratio.

      The result is always passed through boundAndRoundBloomFilterSizeInBytes(long) to enforce minimum/maximum size limits and power-of-two rounding.

      Parameters:
      numDocs - the number of documents in the new segment, must be positive
      Returns:
      bloom filter size in bytes, as a power of two