Benchmark results: CPU (DuckDB) outperforms GPU (Polars/cuDF) on S3-backed H3 joins

## Benchmark summary (gpu mode — Polars Rust S3)

Queries: H3 equi-joins on hive-partitioned parquet from Ceph S3 (internal endpoint).  
Hardware: RTX 4000 Ada (20 GB VRAM), NRP Nautilus node, 100G InfiniBand.  
GPU engine: Polars SQLContext + cuDF (\`QUERY_ENGINE=gpu\`, default mode).  
CPU engine: DuckDB with \`SET s3_allow_recursive_globbing=false; SET preserve_insertion_order=false\`.  
Runs: 3 per query per server; table shows medians.

| Query | Data (compressed) | GPU median (s) | CPU median (s) | CPU/GPU ratio |
|---|---|---|---|---|
| Q1 — IUCN × IUCN (2-table join) | ~0.1 GiB | 13.9 | 7.1 | 0.51× |
| Q2 — IUCN 3-way join | ~0.1 GiB | 19.8 | 12.4 | 0.63× |
| Q3a — Americas carbon × IUCN | ~3 GiB | 62.5 | 14.5 | 0.23× |
| Q4a — Americas carbon × WDPA + GROUP BY | ~3 GiB | 58.7 | 32.6 | 0.56× |
| Q5a — Americas carbon × 2 IUCN | ~3 GiB | 53.3 | 18.5 | 0.35× |

CPU/GPU < 1 means **CPU is faster**. DuckDB CPU wins across all queries.

Full results: \`benchmarks/results-full.csv\`.

## Root cause

All queries are S3 I/O bound. The GPU data path has an extra hop:

```
GPU path:  S3 → Polars object_store (CPU) → CPU RAM → PCIe → GPU VRAM → cuDF compute
CPU path:  S3 → DuckDB reader (CPU) → CPU RAM → DuckDB compute
```

The PCIe transfer adds ~40-50s for ~3 GiB queries. GPU compute savings are smaller than this overhead.

DuckDB also benefits from OS-level S3 cache on repeated runs (explains high CPU variance: Q3a runs 14.4/26.5/14.5s). GPU is more consistent because PCIe transfer dominates regardless of S3 cache state.

## Updates — what has since been implemented

### 1. KvikIO S3 transport ✅ (issue #3, merged in PR #2)

Direct S3 throughput benchmark from inside the cluster pod (carbon Americas, 28 files, 3.22 GiB):

| Transport | Time | Throughput |
|---|---|---|
| kvikio pread (KVIKIO_NTHREADS=64, 16 MiB chunks) | **4.1s** | **6.25 Gbps** |
| Polars Rust object_store | 26.6s | 0.97 Gbps |

6.5× faster S3 download. Key findings that were blocking this:
- \`cudf.read_parquet(storage_options=...)\` uses PyArrow S3, not kvikio (misleading RAPIDS docs)
- \`kvikio.RemoteFile.read()\` is single-threaded; \`pread()\` activates the thread pool
- \`set_num_threads()\` is silently broken in kvikio 25.02 — must use \`KVIKIO_NTHREADS\` env var

\`gpu-cudf\` mode now uses \`kvikio.RemoteFile.pread()\` → \`BytesIO\` → \`cudf.read_parquet()\` with \`KVIKIO_NTHREADS=64\` set in the deployment. End-to-end Q3a benchmark pending (see below).

### 2. Partition pruning in gpu-cudf mode ✅ (issue #4, merged in PR #2)

\`_scan_cudf()\` now extracts h0 predicates from SQL and filters the file list before any reads. Confirmed: Q3a completes without OOM; Q3 (no filter, all 94 files, 7.3 GiB) would OOM without pruning.

## Remaining items

3. **End-to-end gpu-cudf benchmark** — the new kvikio pread pipeline is deployed but not yet benchmarked end-to-end. S3 read is now 4.1s (was 26.6s); the question is whether Q3a GPU drops from 62.5s to competitive with CPU (~14.5s). Results to be added here once available.

4. **Larger GPU** (A100 80GB / H100 80GB): global Q3-Q5 (full 9.9 GiB carbon) OOM the RTX 4000 Ada. A larger GPU would allow testing the full dataset.

5. **Compute-heavy queries**: GROUP BY with many groups, window functions, or queries over already-loaded data would favor GPU. S3-read-heavy analytics favor CPU.

## Partition pruning in default gpu mode

The default \`gpu\` mode (Polars lazy) correctly prunes partitions via DPP:
- Q3a (WHERE h0 IN 28 Americas values): completes in 60s, no OOM
- Q3 (no filter, all 94 partitions): OOMKills the pod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark results: CPU (DuckDB) outperforms GPU (Polars/cuDF) on S3-backed H3 joins #5

Benchmark summary (gpu mode — Polars Rust S3)

Root cause

Updates — what has since been implemented

1. KvikIO S3 transport ✅ (issue #3, merged in PR #2)

2. Partition pruning in gpu-cudf mode ✅ (issue #4, merged in PR #2)

Remaining items

Partition pruning in default gpu mode

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Query	Data (compressed)	GPU median (s)	CPU median (s)	CPU/GPU ratio
Q1 — IUCN × IUCN (2-table join)	~0.1 GiB	13.9	7.1	0.51×
Q2 — IUCN 3-way join	~0.1 GiB	19.8	12.4	0.63×
Q3a — Americas carbon × IUCN	~3 GiB	62.5	14.5	0.23×
Q4a — Americas carbon × WDPA + GROUP BY	~3 GiB	58.7	32.6	0.56×
Q5a — Americas carbon × 2 IUCN	~3 GiB	53.3	18.5	0.35×

Transport	Time	Throughput
kvikio pread (KVIKIO_NTHREADS=64, 16 MiB chunks)	4.1s	6.25 Gbps
Polars Rust object_store	26.6s	0.97 Gbps

Benchmark results: CPU (DuckDB) outperforms GPU (Polars/cuDF) on S3-backed H3 joins #5

Description

Benchmark summary (gpu mode — Polars Rust S3)

Root cause

Updates — what has since been implemented

1. KvikIO S3 transport ✅ (issue #3, merged in PR #2)

2. Partition pruning in gpu-cudf mode ✅ (issue #4, merged in PR #2)

Remaining items

Partition pruning in default gpu mode

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions