Class ApproximationPlan

java.lang.Object
org.elasticsearch.xpack.esql.approximation.ApproximationPlan

public class ApproximationPlan extends Object
The approximation plan, that is substituted during logical plan optimization in the rule SubstituteApproximationPlan.

See the Javadocs of Approximation for more details.

  • Field Details

    • BUCKET_ID_COLUMN_NAME

      public static final String BUCKET_ID_COLUMN_NAME
      The column name for the bucket ID in the sampled aggregate. This is used to assign each sampled row to a bucket, to compute confidence intervals.
    • CONFIDENCE_INTERVAL_COLUMN_PREFIX

      public static final String CONFIDENCE_INTERVAL_COLUMN_PREFIX
      Prefix for confidence interval column names in the approximation output.
      See Also:
    • CERTIFIED_COLUMN_PREFIX

      public static final String CERTIFIED_COLUMN_PREFIX
      Prefix for certified column names in the approximation output.
      See Also:
    • BUCKET_COUNT

      public static final int BUCKET_COUNT
      The number of buckets to use for computing confidence intervals.
      See Also:
  • Constructor Details

    • ApproximationPlan

      public ApproximationPlan()
  • Method Details

    • columnMetadata

      public static Map<String,Object> columnMetadata(Attribute column)
      Returns the _meta map for an approximation column, or null if the column name does not match an approximation pattern.
    • is

      public static boolean is(LogicalPlan logicalPlan)
      Returns whether the logical plan is an approximation plan.
    • get

      public static LogicalPlan get(LogicalPlan logicalPlan, ApproximationSettings settings)
      Returns a plan that approximates the original plan and computes confidence intervals. This approximation query consists of the following:
      • Source command
      • SAMPLE with a ApproximationPlan.SampleProbabilityPlaceHolder for the sample probability
      • All commands before the STATS command
      • EVAL adding a new column with random bucket IDs for each trial
      • STATS command with:
        • COUNT to track the sample size
        • Each aggregate function replaced by a sample-corrected version (if needed)
        • TRIAL_COUNT * BUCKET_COUNT additional columns with a sampled values for each aggregate function, sample-corrected (if needed)
      • FILTER to remove all rows with a too small sample size
      • All commands after the STATS command, modified to also process the additional bucket columns where possible
      • EVAL to compute confidence intervals for all fields with buckets
      • PROJECT to drop all non-output columns
      As an example, the simple query:
           
               FROM index
                   | EVAL x = 2*x
                   | STATS s = SUM(x) BY group
                   | EVAL t = s*s
           
       
      is rewritten to (prob=sampleProbability, T=trialCount, B=bucketCount):
           
               FROM index
                   | EVAL x = 2*x
                   | EVAL bucketId = MV_APPEND(RANDOM(B), ... , RANDOM(B))  // T times
                   | SAMPLED_STATS[SampleProbabilityPlaceHolder]
                           sampleSize = COUNT(*),
                           s = SUM(x),
                           `s$0` = SUM(x) WHERE MV_SLICE(bucketId, 0, 0) == 0
                           ...,
                           `s$T*B-1` = SUM(x) WHERE MV_SLICE(bucketId, T-1, T-1) == B-1
                     BY group
                   | WHERE sampleSize >= MIN_ROW_COUNT_FOR_RESULT_INCLUSION / prob
                   | EVAL t = s*s, `t$0` = `s$0`*`s$0`, ..., `t$T*B-1` = `s$T*B-1`*`s$T*B-1`
                   | EVAL `CONFIDENCE_INTERVAL(s)` = CONFIDENCE_INTERVAL(s, MV_APPEND(`s$0`, ... `s$T*B-1`), T, B, 0.90),
                          `CONFIDENCE_INTERVAL(t)` = CONFIDENCE_INTERVAL(t, MV_APPEND(`t$0`, ... `t$T*B-1`), T, B, 0.90)
                   | KEEP s, t, `CONFIDENCE_INTERVAL(s)`, `CONFIDENCE_INTERVAL(t)`
           
       
      During execution the SAMPLED_STATS is replaced on the data node by either sampling the source rows and a normal STATS (with sample corrections applied to intermediate state), or pushed down to Lucene without any sampling (if possible).
    • substituteSampleProbability

      public static LogicalPlan substituteSampleProbability(LogicalPlan logicalPlan, double sampleProbability)
      Substitutes the ApproximationPlan.SampleProbabilityPlaceHolder in the approximation plan by the actual sample probability. If the sample probability is 1.0, the SampledAggregate is also replaced by a regular Aggregate.