Class ParallelParsingCoordinator

java.lang.Object
org.elasticsearch.xpack.esql.datasources.ParallelParsingCoordinator

public final class ParallelParsingCoordinator extends Object
Coordinates parallel parsing of a single file by splitting it into byte-range segments and dispatching each segment to a parser thread. Results are reassembled in segment order so that row ordering is preserved.

Inspired by ClickHouse's ParallelParsingInputFormat. The approach:

  1. A segmentator divides the file into N byte-range segments at record boundaries
  2. Each segment is parsed independently on a separate executor thread
  3. The coordinator yields pages in segment order via a CloseableIterator

This coordinator only works with SegmentableFormatReader implementations (line-oriented formats like CSV and NDJSON). Columnar formats have their own row-group-level parallelism.

  • Method Details

    • parallelRead

      public static CloseableIterator<Page> parallelRead(SegmentableFormatReader reader, StorageObject storageObject, List<String> projectedColumns, int batchSize, int parallelism, Executor executor) throws IOException
      Creates a parallel-parsing iterator over a single storage object.

      The file is divided into parallelism segments at record boundaries. Each segment is parsed independently and results are yielded in order. If the file is too small for meaningful parallelism (below the reader's SegmentableFormatReader.minimumSegmentSize() per segment), falls back to single-threaded reading.

      Parameters:
      reader - the segmentable format reader
      storageObject - the file to read
      projectedColumns - columns to project
      batchSize - rows per page
      parallelism - number of parallel parser threads
      executor - executor for parser threads
      Returns:
      an iterator that yields pages in segment order
      Throws:
      IOException
    • parallelRead

      public static CloseableIterator<Page> parallelRead(SegmentableFormatReader reader, StorageObject storageObject, List<String> projectedColumns, int batchSize, int parallelism, Executor executor, ErrorPolicy errorPolicy) throws IOException
      Creates a parallel-parsing iterator with an explicit error policy.
      Parameters:
      errorPolicy - error handling policy for per-segment parsing, or null for defaults
      Throws:
      IOException
    • computeSegments

      public static List<long[]> computeSegments(SegmentableFormatReader reader, StorageObject storageObject, long fileLength, int parallelism, long minSegment) throws IOException
      Computes byte-range segments for the file by probing record boundaries. Each segment is a [offset, length] pair. The first segment starts at offset 0; subsequent segments start at the record boundary found after the nominal split point.
      Throws:
      IOException