Interface SegmentableFormatReader

All Superinterfaces:
AutoCloseable, Closeable, FormatReader

public interface SegmentableFormatReader extends FormatReader
Extension of FormatReader for line-oriented text formats (CSV, NDJSON) that support intra-file parallel parsing.

Formats that implement this interface declare they can find record boundaries within an arbitrary byte stream, enabling the framework to split a single file into byte-range segments and parse them concurrently on multiple threads.

Columnar formats (Parquet, ORC) should not implement this interface — they have row-group-level parallelism instead.

  • Method Details

    • findNextRecordBoundary

      long findNextRecordBoundary(InputStream stream) throws IOException
      Scans forward from the current position in the stream to find the start of the next complete record. Returns the number of bytes consumed (skipped) to reach that boundary.

      For newline-delimited formats (CSV, NDJSON), this means scanning until the first \n or \r\n and returning the byte count including the line terminator. The next byte in the stream is the start of a complete record.

      Note on quoting: Implementations for formats that support quoting (e.g. CSV with quoted fields containing embedded newlines) should either track quoting state during the scan or document that parallel parsing is not safe for files with embedded newlines in quoted fields.

      The stream is positioned at an arbitrary byte offset within the file (typically a segment boundary). The implementation must consume bytes until it finds a record boundary, leaving the stream positioned at the start of the next record.

      Parameters:
      stream - an open stream positioned at an arbitrary offset within the file
      Returns:
      the number of bytes consumed to reach the next record boundary, or -1 if the end of stream is reached without finding a boundary
      Throws:
      IOException - if an I/O error occurs while scanning
    • minimumSegmentSize

      default long minimumSegmentSize()
      Returns the minimum segment size in bytes below which splitting is not worthwhile. Segments smaller than this will be merged with an adjacent segment.

      Defaults to 1 MiB. ClickHouse benchmarks show 1 MiB chunks are optimal for parallel parsing — 100 KB chunks are ~40% slower due to per-chunk overhead, while 10 MiB chunks offer only marginal improvement. Implementations may override to reflect their parsing overhead.