Skip to content

danmorcov88/realtime-ecommerce-cdc

Repository files navigation

Real-Time E-Commerce CDC Pipeline

CI

PostgreSQL Debezium Apache Kafka Apache Flink ClickHouse Grafana Docker

An end-to-end streaming change-data-capture (CDC) pipeline that captures every change from an OLTP e-commerce database in real time, processes the stream with Apache Flink (event-time windows, joins, anomaly detection), and materializes the results into a ClickHouse OLAP warehouse visualized live in Grafana. The whole stack — 100% synthetic data, fully isolated — runs locally with a single docker compose up.

Architecture

Architecture

Live dashboard

The Grafana dashboard reads ClickHouse and refreshes every few seconds — revenue per minute, orders per minute, top products, anomaly alerts, and revenue by country:

Grafana dashboard

📸 See the full visual walkthrough → — screenshots of every stage (Debezium connector, CDC events, Avro schemas, Flink jobs, ClickHouse, Grafana) with explanations.

Tech stack

Layer Technology Role
OLTP source PostgreSQL 16 (wal_level=logical) Transactional e-commerce database
Data generator Python (Faker, psycopg2) Continuous synthetic activity
CDC Debezium (Kafka Connect, pgoutput) Capture changes from the WAL
Streaming Apache Kafka (KRaft) + Schema Registry Event transport + Avro schemas
Stream processing Apache Flink 1.19 (SQL) Windows, joins, anomaly detection
OLAP warehouse ClickHouse Fast analytical queries
Visualization Grafana Real-time dashboard
Containerization Docker Compose Reproducible local stack
CI GitHub Actions Lint + config validation

Run locally

Requirements: Docker Desktop (≈6–8 GB free RAM for the full stack).

# 1. Configuration (no secrets in git — only .env.example is committed)
cp .env.example .env

# 2. Bring up the whole stack (first run builds the Connect and Flink images)
docker compose up -d --build
#    Postgres schema and ClickHouse OLAP tables are created automatically on
#    first start; the data generator begins producing activity immediately.

# 3. Register the Debezium connector (waits for Connect, then registers)
bash debezium/register-connector.sh

# 4. Submit the Flink jobs
docker compose exec flink-jobmanager ./bin/sql-client.sh -f /opt/sql/job_a_revenue.sql
docker compose exec flink-jobmanager ./bin/sql-client.sh -f /opt/sql/job_c_anomaly.sql
bash flink/submit-job.sh job_b_top_products.sql      # injects DB creds via envsubst
bash flink/submit-job.sh job_d_enrichment.sql        # enrichment lookup join

# 5. Open Grafana → dashboard "Real-Time E-Commerce CDC" (auto-refresh 5s)
#    http://localhost:3000  (anonymous viewing enabled; admin login from .env)

Bring-up order matters and is enforced by healthchecks/depends_on: Postgres → Kafka/Connect → register connector → Flink jobs → ClickHouse → Grafana.

Service endpoints

Service URL Notes
Grafana http://localhost:3000 Real-time dashboard
Flink Web UI http://localhost:8082 Jobs, checkpoints, backpressure
kafka-ui http://localhost:8080 Topics, messages, schemas, connectors
Kafka Connect REST http://localhost:8087 Connector management
Schema Registry http://localhost:8081 Avro schemas
ClickHouse http://localhost:8123/play SQL query UI

Stream processing jobs

Job File Demonstrates Sink
A — Revenue per minute flink/sql/job_a_revenue.sql event-time, watermarks, tumbling window revenue_by_minute
B — Top products flink/sql/job_b_top_products.sql sliding (HOP) window, Window Top-N, JDBC lookup-join enrichment top_products
C — Anomaly detection flink/sql/job_c_anomaly.sql filtering + windowed stateful aggregation anomaly_alerts
D — Revenue by country flink/sql/job_d_enrichment.sql stream-table (lookup) join + tumbling window orders_by_country

What this demonstrates

  • CDC vs polling — changes are read from the Postgres write-ahead log via logical decoding (Debezium pgoutput), not by polling tables. No load on the source, no missed intermediate states, low latency. REPLICA IDENTITY FULL makes UPDATE/DELETE events carry the full before-image.
  • Schema evolution — events are Avro with schemas in the Schema Registry, enabling forward/backward-compatible evolution instead of brittle JSON.
  • Flink fundamentals — event-time vs processing-time, watermarks, tumbling and sliding (HOP) windows, Window Top-N, keyed/windowed state, and a temporal lookup join for enrichment.
  • Delivery semantics & idempotency — Kafka + ClickHouse give at-least-once delivery; ClickHouse ReplacingMergeTree deduplicates by sort key so replays do not create permanent duplicates.
  • OLTP vs OLAP — a normalized, write-optimized Postgres source is kept separate from a denormalized, read-optimized ClickHouse warehouse; each is tuned for its workload.

Architecture decisions

  • Why CDC (Debezium + logical decoding)? Streaming the WAL captures every state transition (e.g. pending → paid → shipped → delivered) with minimal source overhead — impossible to do reliably with periodic batch polling.
  • Why Avro + Schema Registry? Compact binary encoding plus enforced, versioned schemas — the contract between producers and the streaming layer.
  • Why Apache Flink? True event-time semantics with watermarks and rich windowing/state are exactly what real-time aggregations and anomaly detection need. The CDC topics are read with the avro-confluent format (not debezium-avro-confluent): each change record becomes an append-only INSERT exposing after/op/source, which is what event-time windowing requires.
  • Why ClickHouse? Column-oriented OLAP built for fast aggregations over large tables — ideal for serving live dashboards.
  • Why Flink → Kafka → ClickHouse (not a direct sink)? There is no maintained Flink→ClickHouse SQL connector for Flink 1.19, so Flink writes results to Kafka and ClickHouse ingests them with its native Kafka table engine + materialized view — the vendor-recommended streaming-ingestion path, reusing the Kafka cluster already in the stack.
  • Secrets — never committed. Postgres credentials reach Debezium via Kafka Connect's EnvVarConfigProvider, reach Flink lookup joins via envsubst at submit time, and reach Grafana's ClickHouse user via environment injection. Only .env.example is in git.

Project structure

realtime-ecommerce-cdc/
├── docker-compose.yml            # the whole stack, added one layer per phase
├── .env.example                  # config template (real .env is git-ignored)
├── generator/                    # synthetic e-commerce data generator
│   ├── generate.py
│   ├── schema.sql                # OLTP DDL (auto-run on Postgres init)
│   ├── requirements.txt
│   └── Dockerfile
├── debezium/                     # CDC
│   ├── Dockerfile                # Kafka Connect + Debezium Postgres connector
│   ├── register-postgres.json    # connector config (no secrets)
│   └── register-connector.sh
├── flink/                        # stream processing
│   ├── Dockerfile                # Flink + Kafka/Avro/JDBC connectors
│   ├── submit-job.sh             # submit with envsubst credential injection
│   └── sql/                      # job_a / job_b / job_c / job_d
├── clickhouse/                   # OLAP sink
│   ├── init/                     # OLAP tables + Kafka engine + materialized views
│   └── config/grafana-user.xml   # read-only Grafana user (password from env)
├── grafana/
│   ├── provisioning/             # datasource + dashboard providers
│   └── dashboards/               # real-time dashboard JSON
├── scripts/                      # generate architecture diagram + capture screenshots
├── docs/                         # OVERVIEW.md walkthrough + architecture & UI screenshots
└── .github/workflows/ci.yml      # lint + config validation

License

Released under the MIT License.