Class Approximation

java.lang.Object
org.elasticsearch.xpack.esql.approximation.Approximation

public class Approximation extends Object
This class computes approximate results for certain classes of ES|QL queries. Approximate results are usually much faster to compute than exact results.

A query is currently suitable for approximation if:

  • it contains exactly one STATS command
  • the other processing commands are from the supported set (SUPPORTED_COMMANDS); this set contains almost all unary commands, but most notably not FORK or JOIN.
  • the aggregate functions are from the supported set (SUPPORTED_SINGLE_VALUED_AGGS and SUPPORTED_MULTIVALUED_AGGS)
Some of these restrictions may be lifted in the future.

When these conditions are met, the query is replaced by an approximation query that samples documents before the STATS, and extrapolates the aggregate functions if needed. This new logical plan is generated by ApproximationPlan.get(org.elasticsearch.xpack.esql.plan.logical.LogicalPlan, org.elasticsearch.xpack.esql.approximation.ApproximationSettings). The substitution of the original query by the approximation query happens during logical plan optimization, in the rule SubstituteApproximationPlan.

In addition to approximate results, confidence intervals are also computed. This is done by dividing the sampled rows ApproximationPlan.TRIAL_COUNT times into ApproximationPlan.BUCKET_COUNT buckets, computing the aggregate functions for each bucket, and using these sampled values to compute confidence intervals with the bias-corrected and accelerated (BCa) bootstrap method, see also ConfidenceInterval.

The initial approximation plan contains a placeholder for the sample probability, which is determined during subplan execution, and is based on results set size. To obtain an appropriate sample probability, first a target number of rows is set. This is determined via the ApproximationSettings. Next, the total number of rows in the source index is counted via the subplan sourceCountSubPlan(). This plan always executes fast. When there are no commands that can change the number of rows, the sample probability can be directly computed as a ratio of the target number of rows and this total number.

In the presence of commands that can change the number of rows (e.g. filtering), another step is needed. The first goal is to find a sample probability that leads to approximately ROW_COUNT_FOR_COUNT_ESTIMATION rows, and when this probability is found, a sample probability leading to the target number of rows is computed.

This is done by setting the initial sample probability to the ratio of ROW_COUNT_FOR_COUNT_ESTIMATION and the total number of rows in the source index, and a number of rows is sampled with the subplan countSubPlan(double). As long as the sampled number of rows is too small, the probability is increased until a good probability is reached. This final probability is used to compute the probability using for approximating the original query.

  • Method Details

    • create

      public static Approximation create(LogicalPlan logicalPlan, ApproximationSettings approximationSettings)
      Creates an Approximation object for a logical plan if it's an approximation plan, and returns null otherwise.
    • verifyPlan

      public static Approximation.QueryProperties verifyPlan(LogicalPlan logicalPlan)
      Verifies that a plan is suitable for approximation.
      Returns:
      the query properties relevant for approximation if it's suitable, or null otherwise Adds warning headers as a side effect when the plan is not suitable
    • firstSubPlan

      public LogicalPlan firstSubPlan()
      Returns the first subplan to execute for approximation, or null if the main plan can be executed directly.
    • newMainPlan

      public LogicalPlan newMainPlan(Result result)
      Returns the new main plan to execute for approximation after executing a subplan, based on the result of the subplan.