Class ParallelParsingCoordinator
java.lang.Object
org.elasticsearch.xpack.esql.datasources.ParallelParsingCoordinator
Coordinates parallel parsing of a single file by splitting it into byte-range
segments and dispatching each segment to a parser thread. Results are reassembled
in segment order so that row ordering is preserved.
Inspired by ClickHouse's ParallelParsingInputFormat. The approach:
- A segmentator divides the file into N byte-range segments at record boundaries
- Each segment is parsed independently on a separate executor thread
- The coordinator yields pages in segment order via a
CloseableIterator
This coordinator only works with SegmentableFormatReader implementations
(line-oriented formats like CSV and NDJSON). Columnar formats have their own
row-group-level parallelism.
-
Method Summary
Modifier and TypeMethodDescriptionstatic List<long[]> computeSegments(SegmentableFormatReader reader, StorageObject storageObject, long fileLength, int parallelism, long minSegment) Computes byte-range segments for the file by probing record boundaries.static CloseableIterator<Page> parallelRead(SegmentableFormatReader reader, StorageObject storageObject, List<String> projectedColumns, int batchSize, int parallelism, Executor executor) Creates a parallel-parsing iterator over a single storage object.static CloseableIterator<Page> parallelRead(SegmentableFormatReader reader, StorageObject storageObject, List<String> projectedColumns, int batchSize, int parallelism, Executor executor, ErrorPolicy errorPolicy) Creates a parallel-parsing iterator with an explicit error policy.
-
Method Details
-
parallelRead
public static CloseableIterator<Page> parallelRead(SegmentableFormatReader reader, StorageObject storageObject, List<String> projectedColumns, int batchSize, int parallelism, Executor executor) throws IOException Creates a parallel-parsing iterator over a single storage object.The file is divided into
parallelismsegments at record boundaries. Each segment is parsed independently and results are yielded in order. If the file is too small for meaningful parallelism (below the reader'sSegmentableFormatReader.minimumSegmentSize()per segment), falls back to single-threaded reading.- Parameters:
reader- the segmentable format readerstorageObject- the file to readprojectedColumns- columns to projectbatchSize- rows per pageparallelism- number of parallel parser threadsexecutor- executor for parser threads- Returns:
- an iterator that yields pages in segment order
- Throws:
IOException
-
parallelRead
public static CloseableIterator<Page> parallelRead(SegmentableFormatReader reader, StorageObject storageObject, List<String> projectedColumns, int batchSize, int parallelism, Executor executor, ErrorPolicy errorPolicy) throws IOException Creates a parallel-parsing iterator with an explicit error policy.- Parameters:
errorPolicy- error handling policy for per-segment parsing, ornullfor defaults- Throws:
IOException
-
computeSegments
public static List<long[]> computeSegments(SegmentableFormatReader reader, StorageObject storageObject, long fileLength, int parallelism, long minSegment) throws IOException Computes byte-range segments for the file by probing record boundaries. Each segment is a[offset, length]pair. The first segment starts at offset 0; subsequent segments start at the record boundary found after the nominal split point.- Throws:
IOException
-