Skip to content

Word Cloud unconditional secondary sort prevents Druid TopN optimization, causing query timeout #39072

@bdonovan1

Description

@bdonovan1

Bug description

Description

When sort_by_metric is enabled on a Word Cloud chart, buildQuery.ts unconditionally appends a secondary ORDER BY [series, ASC] alongside the metric sort. On Apache Druid, any multi-column ORDER BY prevents the native TopN query optimization, forcing a full GroupBy scan over the entire dataset before applying LIMIT. On high-cardinality dimensions this can cause dramatic query slowdowns and timeouts.

Root Cause

In superset-frontend/plugins/plugin-chart-word-cloud/src/plugin/buildQuery.ts:

if (sort_by_metric && metric) {
  orderby.push([metric, false]);
}
if (series) {
  orderby.push([series, true]);  // ← always added, even when sort_by_metric is true
}

When sort_by_metric=true, this generates:

ORDER BY term_count DESC, search_term ASC

Druid's TopN algorithm requires ordering by a single aggregate metric. The secondary dimension sort forces the full GroupBy execution path regardless of dataset size.

Steps to Reproduce

  1. Create a Word Cloud chart backed by a Druid datasource with a high-cardinality string dimension
  2. Enable Sort by metric
  3. Load the chart — on large datasets it will time out with "Unknown error (Unknown)"

Without "Sort by metric", the chart loads correctly (Druid uses TopN). With it enabled, the same query triggers a full GroupBy scan.

Proposed Fix

Make the secondary dimension sort mutually exclusive with sort_by_metric:

if (sort_by_metric && metric) {
  orderby.push([metric, false]);
} else if (series) {
  orderby.push([series, true]);
}

This preserves alphabetical ordering when sort_by_metric is disabled, while allowing Druid to use TopN optimization when metric sorting is requested. The secondary sort is also unnecessary for word cloud rendering, since word size is determined by metric value rather than series order.

Additional Notes

The existing test/buildQuery.test.ts does not cover sort_by_metric or orderby behavior. A PR for this fix should add test coverage for both cases.

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions