Interface FormatReader

All Superinterfaces:
AutoCloseable, Closeable
All Known Subinterfaces:
RangeAwareFormatReader, SegmentableFormatReader

public interface FormatReader extends Closeable
Unified interface for reading data formats.

Simple formats: implement only read(StorageObject, FormatReadContext) (sync) - async wrapping is automatic. Async-capable formats: override readAsync(StorageObject, FormatReadContext, Executor, ActionListener) for native async behavior.

The output is ESQL's native Page format rather than Arrow to avoid mandating Arrow as a dependency for all format implementations.

Implementations should provide metadata discovery via metadata(StorageObject) which returns a unified SourceMetadata containing schema and source information.

Per-query format configuration (delimiter, encoding, etc.) is set on the reader instance via withConfig(Map). Per-query optimizer hints (pushed filters for row-group or stripe skipping) are set via withPushedFilter(Object). Per-read execution parameters (projection, batch size, limit, error policy, split config) are bundled in FormatReadContext.

  • Field Details

  • Method Details

    • defaultSchemaResolution

      default FormatReader.SchemaResolution defaultSchemaResolution()
    • defaultErrorPolicy

      default ErrorPolicy defaultErrorPolicy()
      Returns the default error policy for this format. Override to change the default behavior for a specific format (e.g. NDJSON defaults to lenient because skipping malformed lines is its natural behavior).
    • metadata

      SourceMetadata metadata(StorageObject object) throws IOException
      Throws:
      IOException
    • schema

      default List<Attribute> schema(StorageObject object) throws IOException
      Throws:
      IOException
    • read

      Reads data from the given storage object using the provided context.

      This is the primary read method. All implementations must override this method.

      Throws:
      IOException
    • read

      default CloseableIterator<Page> read(StorageObject object, List<String> projectedColumns, int batchSize) throws IOException
      Convenience overload that delegates to read(StorageObject, FormatReadContext). Keeps test code and simple call sites working without constructing a context.
      Throws:
      IOException
    • readAsync

      default void readAsync(StorageObject object, FormatReadContext context, Executor executor, ActionListener<CloseableIterator<Page>> listener)
      Asynchronously reads data from the given storage object using the provided context.

      The default wraps the synchronous read(StorageObject, FormatReadContext) in the provided executor. Formats with native async support should override this.

    • formatName

      String formatName()
    • fileExtensions

      List<String> fileExtensions()
    • withConfig

      default FormatReader withConfig(Map<String,Object> config)
      Returns a format reader configured with the given config map (from the WITH clause). Implementations should parse format-specific options from the config and return a new reader instance if any options are present. The default returns this (no configuration).
    • withPushedFilter

      default FormatReader withPushedFilter(Object pushedFilter)
      Returns a format reader configured with the given pushed filter from the optimizer.

      The pushed filter is an opaque object produced by FilterPushdownSupport during local physical optimization. Only format readers that support predicate pushdown (e.g., Parquet row-group skipping, ORC stripe-level predicates) need to override this.

      The filter is per-query: it applies identically to every file/split in the query. Implementations should cast the filter to their expected type and return a new reader instance with the filter stored as an instance field.

      Parameters:
      pushedFilter - opaque filter object, or null if no filter was pushed
      Returns:
      a new reader with the filter applied, or this if the filter is not applicable
    • aggregatePushdownSupport

      default AggregatePushdownSupport aggregatePushdownSupport()
      Returns the aggregate pushdown support for this format. Only format readers with column statistics in their metadata (Parquet, ORC) override this.
    • withSchema

      default FormatReader withSchema(List<Attribute> schema)
      Returns a format reader configured with the schema attributes.

      The schema is determined during the planning phase (via metadata(StorageObject)) and is constant for all files/splits in a query. Passing it here allows the reader to skip re-reading/inferring the schema from the file header on every read, which is especially important for split-based reads where the split may start mid-file (no header available).

      Formats with embedded schemas (Parquet, ORC) may ignore this since they always read the schema from the file metadata.

      Parameters:
      schema - the planning-phase schema attributes, or null to clear
      Returns:
      a new reader with the schema set, or this if the schema is not needed
    • filterPushdownSupport

      default FilterPushdownSupport filterPushdownSupport()
      Returns the filter pushdown support for this format, or null if not supported.

      When non-null, the optimizer can translate ESQL filter expressions into format-specific predicates (e.g., Parquet FilterPredicate) that enable row-group skipping via statistics, dictionary, and bloom filter checks.

      Returns:
      FilterPushdownSupport for this format, or null if not supported
    • supportsNativeAsync

      default boolean supportsNativeAsync()