Class BytesRefSwissHash

java.lang.Object
org.elasticsearch.swisshash.SwissHash
org.elasticsearch.swisshash.BytesRefSwissHash
All Implemented Interfaces:
Closeable, AutoCloseable, org.apache.lucene.util.Accountable, BytesRefHashTable, org.elasticsearch.core.Releasable

public final class BytesRefSwissHash extends SwissHash implements org.apache.lucene.util.Accountable, BytesRefHashTable
Assigns int ids to BytesRefs, vending the ids in order they are added.

At it's core there are two hash table implementations, a "small core" and a "big core". The "small core" is a simple open addressed hash table with a fixed 60% load factor and a table of 2048. It quite quick because it has a fixed size and never grows.

When the "small core" has more entries than it's load factor the "small core" is replaced with a "big core". The "big core" functions quite similarly to a Swisstable, Google's fancy SIMD hash table. In this table there's a contiguous array of "control" bytes that are either 0b1111_1111 for empty entries or 0b0aaa_aaaa for populated entries, where aaa_aaaa are the top 7 bytes of the hash. To find an entry by key you hash it, grab the top 7 bytes or it, and perform a SIMD read of the control array starting at the expected slot. We use the widest SIMD instruction the CPU supports, meaning 64 or 32 bytes. If any of those match we check the actual key. So instead of scanning one slot at a time "small core", we effectively scan a whole bunch at once. This allows us to run a much higher load factor (87.5%) without any performance penalty so the extra byte feels super worth it.

When a "big core" fills it's table to the fill factor, we build a new "big core" nd read all values in the old "big core" into the new one.

This class does not store the keys in the hash table slots. Instead, it uses a BytesRefArray to store the actual bytes, and the hash table slots store the id which indexes into the BytesRefArray.

  • Method Details

    • find

      public long find(org.apache.lucene.util.BytesRef key)
      Finds an id by a key.
      Specified by:
      find in interface BytesRefHashTable
    • add

      public long add(org.apache.lucene.util.BytesRef key)
      Adds a key, returning its id. If it was already present it's previous assigned id will be returned. If it wasn't present it'll be assigned a new id.
      Specified by:
      add in interface BytesRefHashTable
    • status

      public SwissHash.Status status()
      Description copied from class: SwissHash
      Performance information hopefully useful for debugging.
      Specified by:
      status in class SwissHash
    • iterator

      public BytesRefSwissHash.Itr iterator()
      Description copied from class: SwissHash
      Build an iterator to walk all values and ids.
      Specified by:
      iterator in class SwissHash
    • close

      public void close()
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Specified by:
      close in interface org.elasticsearch.core.Releasable
    • get

      public org.apache.lucene.util.BytesRef get(long id, org.apache.lucene.util.BytesRef dest)
      Returns the key at 0 <= id <= size(). The result is undefined if the id is unused.
      Specified by:
      get in interface BytesRefHashTable
      Parameters:
      id - the id returned when the key was added
      Returns:
      the key
    • getBytesRefs

      public BytesRefArray getBytesRefs()
      Returns the key array.
      Specified by:
      getBytesRefs in interface BytesRefHashTable
    • ramBytesUsed

      public long ramBytesUsed()
      Specified by:
      ramBytesUsed in interface org.apache.lucene.util.Accountable