Skip to content

EXPLAIN ANALYZE support in distributed planner #476

@kurtvolmar

Description

@kurtvolmar

Description

When the DistributedQueryPlanner is used, results of EXPLAIN ANALYZE queries do not show the execution plan with metrics. The distributed planner wraps the entire ExecutionPlan with a DistributedExec. If the head of the plan is an AnalyzeExec, it will still be wrapped with DistributedExec, causing the output to be query results, not the execution plan with metrics.

In my personal fork, I have replicated this issue in a simple integration test: kurtvolmar@2da5ee2

Below I'll also put the query, execution plan, and execution results.

SQL Query - simple EXPLAIN ANALYZE query

EXPLAIN ANALYZE SELECT count(*), "RainToday" FROM weather GROUP BY "RainToday" ORDER BY count(*)

Distributed execution plan - note that DistributedExec wraps AnalyzeExec

        ┌───── DistributedExec ── Tasks: t0:[p0]
        │ AnalyzeExec verbose=false
        │   ProjectionExec: expr=[count(*)@0 as count(*), RainToday@1 as RainToday]
        │     SortPreservingMergeExec: [count(Int64(1))@2 ASC NULLS LAST]
        │       [Stage 2] => NetworkCoalesceExec: output_partitions=6, input_tasks=2
        └──────────────────────────────────────────────────
          ┌───── Stage 2 ── Tasks: t0:[p0..p2] t1:[p0..p2]
          │ SortExec: expr=[count(*)@0 ASC NULLS LAST], preserve_partitioning=[true]
          │   ProjectionExec: expr=[count(Int64(1))@1 as count(*), RainToday@0 as RainToday, count(Int64(1))@1 as count(Int64(1))]
          │     AggregateExec: mode=FinalPartitioned, gby=[RainToday@0 as RainToday], aggr=[count(Int64(1))]
          │       [Stage 1] => NetworkShuffleExec: output_partitions=3, input_tasks=3
          └──────────────────────────────────────────────────
            ┌───── Stage 1 ── Tasks: t0:[p0..p5] t1:[p0..p5] t2:[p0..p5]
            │ RepartitionExec: partitioning=Hash([RainToday@0], 6), input_partitions=1
            │   AggregateExec: mode=Partial, gby=[RainToday@0 as RainToday], aggr=[count(Int64(1))]
            │     PartitionIsolatorExec: tasks=3 partitions=3
            │       DataSourceExec: file_groups={3 groups: [[/testdata/weather/result-000000.parquet], [/testdata/weather/result-000001.parquet], [/testdata/weather/result-000002.parquet]]}, projection=[RainToday], file_type=parquet
            └──────────────────────────────────────────────────

Distributed execution results:

+----------+-----------+
| count(*) | RainToday |
+----------+-----------+
| 66       | Yes       |
| 300      | No        |
+----------+-----------+

I would expect the distributed execution results to be EXPLAIN ANALYZE output, rather than query results.

Next Steps

I have built a client-side workaround to be able to execute EXPLAIN ANALYZE, but it requires some tedious extra steps of plan manipulation and calling the library-provided explain_analyze function. Unfortunately, this workaround is currently broken with the recent changes to distributed planning.

My goal would be to have EXPLAIN ANALYZE work out of the box when using datafusion-distributed, rather than having to work around it. There are a number of ways this could be accomplished, but I wanted to open this issue to discuss whether this is a feature the library would want. I am also happy to take a pass at implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions