An end-to-end streaming change-data-capture (CDC) pipeline that captures every
change from an OLTP e-commerce database in real time, processes the stream with
Apache Flink (event-time windows, joins, anomaly detection), and materializes
the results into a ClickHouse OLAP warehouse visualized live in Grafana. The
whole stack — 100% synthetic data, fully isolated — runs locally with a single
docker compose up.
The Grafana dashboard reads ClickHouse and refreshes every few seconds — revenue per minute, orders per minute, top products, anomaly alerts, and revenue by country:
📸 See the full visual walkthrough → — screenshots of every stage (Debezium connector, CDC events, Avro schemas, Flink jobs, ClickHouse, Grafana) with explanations.
| Layer | Technology | Role |
|---|---|---|
| OLTP source | PostgreSQL 16 (wal_level=logical) |
Transactional e-commerce database |
| Data generator | Python (Faker, psycopg2) | Continuous synthetic activity |
| CDC | Debezium (Kafka Connect, pgoutput) |
Capture changes from the WAL |
| Streaming | Apache Kafka (KRaft) + Schema Registry | Event transport + Avro schemas |
| Stream processing | Apache Flink 1.19 (SQL) | Windows, joins, anomaly detection |
| OLAP warehouse | ClickHouse | Fast analytical queries |
| Visualization | Grafana | Real-time dashboard |
| Containerization | Docker Compose | Reproducible local stack |
| CI | GitHub Actions | Lint + config validation |
Requirements: Docker Desktop (≈6–8 GB free RAM for the full stack).
# 1. Configuration (no secrets in git — only .env.example is committed)
cp .env.example .env
# 2. Bring up the whole stack (first run builds the Connect and Flink images)
docker compose up -d --build
# Postgres schema and ClickHouse OLAP tables are created automatically on
# first start; the data generator begins producing activity immediately.
# 3. Register the Debezium connector (waits for Connect, then registers)
bash debezium/register-connector.sh
# 4. Submit the Flink jobs
docker compose exec flink-jobmanager ./bin/sql-client.sh -f /opt/sql/job_a_revenue.sql
docker compose exec flink-jobmanager ./bin/sql-client.sh -f /opt/sql/job_c_anomaly.sql
bash flink/submit-job.sh job_b_top_products.sql # injects DB creds via envsubst
bash flink/submit-job.sh job_d_enrichment.sql # enrichment lookup join
# 5. Open Grafana → dashboard "Real-Time E-Commerce CDC" (auto-refresh 5s)
# http://localhost:3000 (anonymous viewing enabled; admin login from .env)Bring-up order matters and is enforced by healthchecks/depends_on:
Postgres → Kafka/Connect → register connector → Flink jobs → ClickHouse → Grafana.
| Service | URL | Notes |
|---|---|---|
| Grafana | http://localhost:3000 | Real-time dashboard |
| Flink Web UI | http://localhost:8082 | Jobs, checkpoints, backpressure |
| kafka-ui | http://localhost:8080 | Topics, messages, schemas, connectors |
| Kafka Connect REST | http://localhost:8087 | Connector management |
| Schema Registry | http://localhost:8081 | Avro schemas |
| ClickHouse | http://localhost:8123/play | SQL query UI |
| Job | File | Demonstrates | Sink |
|---|---|---|---|
| A — Revenue per minute | flink/sql/job_a_revenue.sql | event-time, watermarks, tumbling window | revenue_by_minute |
| B — Top products | flink/sql/job_b_top_products.sql | sliding (HOP) window, Window Top-N, JDBC lookup-join enrichment | top_products |
| C — Anomaly detection | flink/sql/job_c_anomaly.sql | filtering + windowed stateful aggregation | anomaly_alerts |
| D — Revenue by country | flink/sql/job_d_enrichment.sql | stream-table (lookup) join + tumbling window | orders_by_country |
- CDC vs polling — changes are read from the Postgres write-ahead log via
logical decoding (Debezium
pgoutput), not by polling tables. No load on the source, no missed intermediate states, low latency.REPLICA IDENTITY FULLmakes UPDATE/DELETE events carry the full before-image. - Schema evolution — events are Avro with schemas in the Schema Registry, enabling forward/backward-compatible evolution instead of brittle JSON.
- Flink fundamentals — event-time vs processing-time, watermarks, tumbling and sliding (HOP) windows, Window Top-N, keyed/windowed state, and a temporal lookup join for enrichment.
- Delivery semantics & idempotency — Kafka + ClickHouse give at-least-once
delivery; ClickHouse
ReplacingMergeTreededuplicates by sort key so replays do not create permanent duplicates. - OLTP vs OLAP — a normalized, write-optimized Postgres source is kept separate from a denormalized, read-optimized ClickHouse warehouse; each is tuned for its workload.
- Why CDC (Debezium + logical decoding)? Streaming the WAL captures every
state transition (e.g.
pending → paid → shipped → delivered) with minimal source overhead — impossible to do reliably with periodic batch polling. - Why Avro + Schema Registry? Compact binary encoding plus enforced, versioned schemas — the contract between producers and the streaming layer.
- Why Apache Flink? True event-time semantics with watermarks and rich
windowing/state are exactly what real-time aggregations and anomaly detection
need. The CDC topics are read with the
avro-confluentformat (notdebezium-avro-confluent): each change record becomes an append-onlyINSERTexposingafter/op/source, which is what event-time windowing requires. - Why ClickHouse? Column-oriented OLAP built for fast aggregations over large tables — ideal for serving live dashboards.
- Why Flink → Kafka → ClickHouse (not a direct sink)? There is no maintained Flink→ClickHouse SQL connector for Flink 1.19, so Flink writes results to Kafka and ClickHouse ingests them with its native Kafka table engine + materialized view — the vendor-recommended streaming-ingestion path, reusing the Kafka cluster already in the stack.
- Secrets — never committed. Postgres credentials reach Debezium via Kafka
Connect's
EnvVarConfigProvider, reach Flink lookup joins viaenvsubstat submit time, and reach Grafana's ClickHouse user via environment injection. Only.env.exampleis in git.
realtime-ecommerce-cdc/
├── docker-compose.yml # the whole stack, added one layer per phase
├── .env.example # config template (real .env is git-ignored)
├── generator/ # synthetic e-commerce data generator
│ ├── generate.py
│ ├── schema.sql # OLTP DDL (auto-run on Postgres init)
│ ├── requirements.txt
│ └── Dockerfile
├── debezium/ # CDC
│ ├── Dockerfile # Kafka Connect + Debezium Postgres connector
│ ├── register-postgres.json # connector config (no secrets)
│ └── register-connector.sh
├── flink/ # stream processing
│ ├── Dockerfile # Flink + Kafka/Avro/JDBC connectors
│ ├── submit-job.sh # submit with envsubst credential injection
│ └── sql/ # job_a / job_b / job_c / job_d
├── clickhouse/ # OLAP sink
│ ├── init/ # OLAP tables + Kafka engine + materialized views
│ └── config/grafana-user.xml # read-only Grafana user (password from env)
├── grafana/
│ ├── provisioning/ # datasource + dashboard providers
│ └── dashboards/ # real-time dashboard JSON
├── scripts/ # generate architecture diagram + capture screenshots
├── docs/ # OVERVIEW.md walkthrough + architecture & UI screenshots
└── .github/workflows/ci.yml # lint + config validation
Released under the MIT License.

