Real-Time E-Commerce CDC Pipeline

An end-to-end streaming change-data-capture (CDC) pipeline that captures every change from an OLTP e-commerce database in real time, processes the stream with Apache Flink (event-time windows, joins, anomaly detection), and materializes the results into a ClickHouse OLAP warehouse visualized live in Grafana. The whole stack — 100% synthetic data, fully isolated — runs locally with a single docker compose up.

Architecture

Live dashboard

The Grafana dashboard reads ClickHouse and refreshes every few seconds — revenue per minute, orders per minute, top products, anomaly alerts, and revenue by country:

📸 See the full visual walkthrough → — screenshots of every stage (Debezium connector, CDC events, Avro schemas, Flink jobs, ClickHouse, Grafana) with explanations.

Tech stack

Layer	Technology	Role
OLTP source	PostgreSQL 16 (`wal_level=logical`)	Transactional e-commerce database
Data generator	Python (Faker, psycopg2)	Continuous synthetic activity
CDC	Debezium (Kafka Connect, `pgoutput`)	Capture changes from the WAL
Streaming	Apache Kafka (KRaft) + Schema Registry	Event transport + Avro schemas
Stream processing	Apache Flink 1.19 (SQL)	Windows, joins, anomaly detection
OLAP warehouse	ClickHouse	Fast analytical queries
Visualization	Grafana	Real-time dashboard
Containerization	Docker Compose	Reproducible local stack
CI	GitHub Actions	Lint + config validation

Run locally

Requirements: Docker Desktop (≈6–8 GB free RAM for the full stack).

# 1. Configuration (no secrets in git — only .env.example is committed)
cp .env.example .env

# 2. Bring up the whole stack (first run builds the Connect and Flink images)
docker compose up -d --build
#    Postgres schema and ClickHouse OLAP tables are created automatically on
#    first start; the data generator begins producing activity immediately.

# 3. Register the Debezium connector (waits for Connect, then registers)
bash debezium/register-connector.sh

# 4. Submit the Flink jobs
docker compose exec flink-jobmanager ./bin/sql-client.sh -f /opt/sql/job_a_revenue.sql
docker compose exec flink-jobmanager ./bin/sql-client.sh -f /opt/sql/job_c_anomaly.sql
bash flink/submit-job.sh job_b_top_products.sql      # injects DB creds via envsubst
bash flink/submit-job.sh job_d_enrichment.sql        # enrichment lookup join

# 5. Open Grafana → dashboard "Real-Time E-Commerce CDC" (auto-refresh 5s)
#    http://localhost:3000  (anonymous viewing enabled; admin login from .env)

Bring-up order matters and is enforced by healthchecks/depends_on: Postgres → Kafka/Connect → register connector → Flink jobs → ClickHouse → Grafana.

Service endpoints

Service	URL	Notes
Grafana	http://localhost:3000	Real-time dashboard
Flink Web UI	http://localhost:8082	Jobs, checkpoints, backpressure
kafka-ui	http://localhost:8080	Topics, messages, schemas, connectors
Kafka Connect REST	http://localhost:8087	Connector management
Schema Registry	http://localhost:8081	Avro schemas
ClickHouse	http://localhost:8123/play	SQL query UI

Stream processing jobs

Job	File	Demonstrates	Sink
A — Revenue per minute	flink/sql/job_a_revenue.sql	event-time, watermarks, tumbling window	`revenue_by_minute`
B — Top products	flink/sql/job_b_top_products.sql	sliding (HOP) window, Window Top-N, JDBC lookup-join enrichment	`top_products`
C — Anomaly detection	flink/sql/job_c_anomaly.sql	filtering + windowed stateful aggregation	`anomaly_alerts`
D — Revenue by country	flink/sql/job_d_enrichment.sql	stream-table (lookup) join + tumbling window	`orders_by_country`

What this demonstrates

CDC vs polling — changes are read from the Postgres write-ahead log via logical decoding (Debezium pgoutput), not by polling tables. No load on the source, no missed intermediate states, low latency. REPLICA IDENTITY FULL makes UPDATE/DELETE events carry the full before-image.
Schema evolution — events are Avro with schemas in the Schema Registry, enabling forward/backward-compatible evolution instead of brittle JSON.
Flink fundamentals — event-time vs processing-time, watermarks, tumbling and sliding (HOP) windows, Window Top-N, keyed/windowed state, and a temporal lookup join for enrichment.
Delivery semantics & idempotency — Kafka + ClickHouse give at-least-once delivery; ClickHouse ReplacingMergeTree deduplicates by sort key so replays do not create permanent duplicates.
OLTP vs OLAP — a normalized, write-optimized Postgres source is kept separate from a denormalized, read-optimized ClickHouse warehouse; each is tuned for its workload.

Architecture decisions

Why CDC (Debezium + logical decoding)? Streaming the WAL captures every state transition (e.g. pending → paid → shipped → delivered) with minimal source overhead — impossible to do reliably with periodic batch polling.
Why Avro + Schema Registry? Compact binary encoding plus enforced, versioned schemas — the contract between producers and the streaming layer.
Why Apache Flink? True event-time semantics with watermarks and rich windowing/state are exactly what real-time aggregations and anomaly detection need. The CDC topics are read with the avro-confluent format (not debezium-avro-confluent): each change record becomes an append-only INSERT exposing after/op/source, which is what event-time windowing requires.
Why ClickHouse? Column-oriented OLAP built for fast aggregations over large tables — ideal for serving live dashboards.
Why Flink → Kafka → ClickHouse (not a direct sink)? There is no maintained Flink→ClickHouse SQL connector for Flink 1.19, so Flink writes results to Kafka and ClickHouse ingests them with its native Kafka table engine + materialized view — the vendor-recommended streaming-ingestion path, reusing the Kafka cluster already in the stack.
Secrets — never committed. Postgres credentials reach Debezium via Kafka Connect's EnvVarConfigProvider, reach Flink lookup joins via envsubst at submit time, and reach Grafana's ClickHouse user via environment injection. Only .env.example is in git.

Project structure

realtime-ecommerce-cdc/
├── docker-compose.yml            # the whole stack, added one layer per phase
├── .env.example                  # config template (real .env is git-ignored)
├── generator/                    # synthetic e-commerce data generator
│   ├── generate.py
│   ├── schema.sql                # OLTP DDL (auto-run on Postgres init)
│   ├── requirements.txt
│   └── Dockerfile
├── debezium/                     # CDC
│   ├── Dockerfile                # Kafka Connect + Debezium Postgres connector
│   ├── register-postgres.json    # connector config (no secrets)
│   └── register-connector.sh
├── flink/                        # stream processing
│   ├── Dockerfile                # Flink + Kafka/Avro/JDBC connectors
│   ├── submit-job.sh             # submit with envsubst credential injection
│   └── sql/                      # job_a / job_b / job_c / job_d
├── clickhouse/                   # OLAP sink
│   ├── init/                     # OLAP tables + Kafka engine + materialized views
│   └── config/grafana-user.xml   # read-only Grafana user (password from env)
├── grafana/
│   ├── provisioning/             # datasource + dashboard providers
│   └── dashboards/               # real-time dashboard JSON
├── scripts/                      # generate architecture diagram + capture screenshots
├── docs/                         # OVERVIEW.md walkthrough + architecture & UI screenshots
└── .github/workflows/ci.yml      # lint + config validation

License

Released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time E-Commerce CDC Pipeline

Architecture

Live dashboard

Tech stack

Run locally

Service endpoints

Stream processing jobs

What this demonstrates

Architecture decisions

Project structure

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
clickhouse		clickhouse
debezium		debezium
docs		docs
flink		flink
generator		generator
grafana		grafana
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
ruff.toml		ruff.toml

Folders and files

Latest commit

History

Repository files navigation

Real-Time E-Commerce CDC Pipeline

Architecture

Live dashboard

Tech stack

Run locally

Service endpoints

Stream processing jobs

What this demonstrates

Architecture decisions

Project structure

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages