Interface SegmentableFormatReader
- All Superinterfaces:
AutoCloseable,Closeable,FormatReader
FormatReader for line-oriented text formats (CSV, NDJSON)
that support intra-file parallel parsing.
Formats that implement this interface declare they can find record boundaries within an arbitrary byte stream, enabling the framework to split a single file into byte-range segments and parse them concurrently on multiple threads.
Columnar formats (Parquet, ORC) should not implement this interface — they have row-group-level parallelism instead.
-
Nested Class Summary
Nested classes/interfaces inherited from interface org.elasticsearch.xpack.esql.datasources.spi.FormatReader
FormatReader.SchemaResolution -
Field Summary
Fields inherited from interface org.elasticsearch.xpack.esql.datasources.spi.FormatReader
NO_LIMIT -
Method Summary
Modifier and TypeMethodDescriptionlongfindNextRecordBoundary(InputStream stream) Scans forward from the current position in the stream to find the start of the next complete record.default longReturns the minimum segment size in bytes below which splitting is not worthwhile.Methods inherited from interface org.elasticsearch.xpack.esql.datasources.spi.FormatReader
aggregatePushdownSupport, defaultErrorPolicy, defaultSchemaResolution, fileExtensions, filterPushdownSupport, formatName, metadata, read, read, readAsync, schema, supportsNativeAsync, withConfig, withPushedFilter, withSchema
-
Method Details
-
findNextRecordBoundary
Scans forward from the current position in the stream to find the start of the next complete record. Returns the number of bytes consumed (skipped) to reach that boundary.For newline-delimited formats (CSV, NDJSON), this means scanning until the first
\nor\r\nand returning the byte count including the line terminator. The next byte in the stream is the start of a complete record.Note on quoting: Implementations for formats that support quoting (e.g. CSV with quoted fields containing embedded newlines) should either track quoting state during the scan or document that parallel parsing is not safe for files with embedded newlines in quoted fields.
The stream is positioned at an arbitrary byte offset within the file (typically a segment boundary). The implementation must consume bytes until it finds a record boundary, leaving the stream positioned at the start of the next record.
- Parameters:
stream- an open stream positioned at an arbitrary offset within the file- Returns:
- the number of bytes consumed to reach the next record boundary,
or
-1if the end of stream is reached without finding a boundary - Throws:
IOException- if an I/O error occurs while scanning
-
minimumSegmentSize
default long minimumSegmentSize()Returns the minimum segment size in bytes below which splitting is not worthwhile. Segments smaller than this will be merged with an adjacent segment.Defaults to 1 MiB. ClickHouse benchmarks show 1 MiB chunks are optimal for parallel parsing — 100 KB chunks are ~40% slower due to per-chunk overhead, while 10 MiB chunks offer only marginal improvement. Implementations may override to reflect their parsing overhead.
-