Skip to content

feat: Add first round of Warpstream topics to Clickhouse#53820

Open
jose-sequeira wants to merge 4 commits intomasterfrom
jose-sequeira/warpstream-kafka-engines
Open

feat: Add first round of Warpstream topics to Clickhouse#53820
jose-sequeira wants to merge 4 commits intomasterfrom
jose-sequeira/warpstream-kafka-engines

Conversation

@jose-sequeira
Copy link
Copy Markdown
Contributor

Problem

ClickHouse Kafka engine tables for log_entries, clickhouse_app_metrics2, and clickhouse_tophog currently only exist for MSK. We need equivalent tables reading from the same topics via a WarpStream cluster to support the WarpStream migration.

Changes

  • Renamed CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION to CLICKHOUSE_KAFKA_WARPSTREAM_INGESTION_NAMED_COLLECTION for clarity, since it refers specifically to the ingestion cluster
  • Added WarpStream consumer group constants (_ws suffix) to avoid conflicts with existing MSK consumer groups
  • Added WarpStream Kafka engine tables and materialized views for log_entries, app_metrics2, and tophog that coexist alongside the existing MSK tables and write to the same target tables
  • Added ClickHouse migration 0227 to create all 6 new tables, matching the node roles of their MSK counterparts (INGESTION_SMALL for log_entries, INGESTION_MEDIUM for app_metrics2 and tophog)

How did you test this code?

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Publish to changelog?

Docs update

@jose-sequeira jose-sequeira requested a review from a team as a code owner April 9, 2026 06:43
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 9, 2026

Vulnerabilities

No security concerns identified. The new tables read from Kafka topics using named collections (no credentials in SQL), consumer groups are namespaced to avoid conflicts, and the changes follow the same auth/connection patterns as the existing MSK tables.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: posthog/settings/data_stores.py
Line: 327-330

Comment:
**Silent env var rename breaks existing deployments**

The environment variable key changed from `CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION` to `CLICKHOUSE_KAFKA_WARPSTREAM_INGESTION_NAMED_COLLECTION`. Any deployment that already has `CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION` set (e.g. in Kubernetes secrets or Terraform) will silently fall back to the default `"warpstream_ingestion"` — the old var is simply ignored. If the deployed value differs from the default, `distinct_id_usage`'s Kafka engine table (the only current consumer) will connect to the wrong named collection without any error.

Consider reading both names and picking the old one as a fallback if the new one is unset, or coordinating the env-var rename with infra config changes in the same deployment.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/models/tophog/sql.py
Line: 145-164

Comment:
**Duplicate MV function (OnceAndOnlyOnce)**

`TOPHOG_WS_MV_SQL` and `TOPHOG_MV_SQL` are identical except for which name constants they inject (`WS_MV_NAME`/`KAFKA_WS_TABLE_NAME` vs `MV_NAME`/`KAFKA_TABLE_NAME`). Both functions could be collapsed into the existing one by accepting `mv_name` and `kafka_table` parameters, eliminating the duplication entirely. The same pattern applies to `LOG_ENTRIES_WS_MV_SQL` vs `LOG_ENTRIES_V3_TABLE_MV_SQL` in `log_entries.py`, and `APP_METRICS2_WS_MV_TABLE_SQL` vs `APP_METRICS2_MV_TABLE_SQL` in `app_metrics2/sql.py`.

```python
# Example refactor for tophog:
def TOPHOG_MV_SQL(
    target_table: str = WRITABLE_TABLE_NAME,
    mv_name: str = MV_NAME,
    kafka_table: str = KAFKA_TABLE_NAME,
):
    return """
CREATE MATERIALIZED VIEW IF NOT EXISTS {mv_name}
TO {target_table}
AS SELECT ...
FROM {kafka_table}
""".format(mv_name=mv_name, target_table=target_table, kafka_table=kafka_table)

# WS variant becomes a one-liner:
TOPHOG_WS_MV_SQL = lambda target_table=WRITABLE_TABLE_NAME: TOPHOG_MV_SQL(
    target_table=target_table, mv_name=WS_MV_NAME, kafka_table=KAFKA_WS_TABLE_NAME
)
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat: Add first round of Warpstream topi..." | Re-trigger Greptile

Comment on lines 327 to +330

CLICKHOUSE_KAFKA_NAMED_COLLECTION: str = os.getenv("CLICKHOUSE_KAFKA_NAMED_COLLECTION", "msk_cluster")
CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION: str = os.getenv(
"CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION", "warpstream_ingestion"
CLICKHOUSE_KAFKA_WARPSTREAM_INGESTION_NAMED_COLLECTION: str = os.getenv(
"CLICKHOUSE_KAFKA_WARPSTREAM_INGESTION_NAMED_COLLECTION", "warpstream_ingestion"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent env var rename breaks existing deployments

The environment variable key changed from CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION to CLICKHOUSE_KAFKA_WARPSTREAM_INGESTION_NAMED_COLLECTION. Any deployment that already has CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION set (e.g. in Kubernetes secrets or Terraform) will silently fall back to the default "warpstream_ingestion" — the old var is simply ignored. If the deployed value differs from the default, distinct_id_usage's Kafka engine table (the only current consumer) will connect to the wrong named collection without any error.

Consider reading both names and picking the old one as a fallback if the new one is unset, or coordinating the env-var rename with infra config changes in the same deployment.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/settings/data_stores.py
Line: 327-330

Comment:
**Silent env var rename breaks existing deployments**

The environment variable key changed from `CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION` to `CLICKHOUSE_KAFKA_WARPSTREAM_INGESTION_NAMED_COLLECTION`. Any deployment that already has `CLICKHOUSE_KAFKA_WARPSTREAM_NAMED_COLLECTION` set (e.g. in Kubernetes secrets or Terraform) will silently fall back to the default `"warpstream_ingestion"` — the old var is simply ignored. If the deployed value differs from the default, `distinct_id_usage`'s Kafka engine table (the only current consumer) will connect to the wrong named collection without any error.

Consider reading both names and picking the old one as a fallback if the new one is unset, or coordinating the env-var rename with infra config changes in the same deployment.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orian this variable is not defined in any chart. So it always fallbacks to the default

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

🎭 Playwright report · View test results →

⚠️ 1 flaky test:

  • Materialize view pane (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog bot commented Apr 9, 2026

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

Copy link
Copy Markdown
Contributor

@orian orian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz take a look at greptile comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants