Benchmark summary (gpu mode — Polars Rust S3)
Queries: H3 equi-joins on hive-partitioned parquet from Ceph S3 (internal endpoint).
Hardware: RTX 4000 Ada (20 GB VRAM), NRP Nautilus node, 100G InfiniBand.
GPU engine: Polars SQLContext + cuDF (`QUERY_ENGINE=gpu`, default mode).
CPU engine: DuckDB with `SET s3_allow_recursive_globbing=false; SET preserve_insertion_order=false`.
Runs: 3 per query per server; table shows medians.
| Query |
Data (compressed) |
GPU median (s) |
CPU median (s) |
CPU/GPU ratio |
| Q1 — IUCN × IUCN (2-table join) |
~0.1 GiB |
13.9 |
7.1 |
0.51× |
| Q2 — IUCN 3-way join |
~0.1 GiB |
19.8 |
12.4 |
0.63× |
| Q3a — Americas carbon × IUCN |
~3 GiB |
62.5 |
14.5 |
0.23× |
| Q4a — Americas carbon × WDPA + GROUP BY |
~3 GiB |
58.7 |
32.6 |
0.56× |
| Q5a — Americas carbon × 2 IUCN |
~3 GiB |
53.3 |
18.5 |
0.35× |
CPU/GPU < 1 means CPU is faster. DuckDB CPU wins across all queries.
Full results: `benchmarks/results-full.csv`.
Root cause
All queries are S3 I/O bound. The GPU data path has an extra hop:
GPU path: S3 → Polars object_store (CPU) → CPU RAM → PCIe → GPU VRAM → cuDF compute
CPU path: S3 → DuckDB reader (CPU) → CPU RAM → DuckDB compute
The PCIe transfer adds ~40-50s for ~3 GiB queries. GPU compute savings are smaller than this overhead.
DuckDB also benefits from OS-level S3 cache on repeated runs (explains high CPU variance: Q3a runs 14.4/26.5/14.5s). GPU is more consistent because PCIe transfer dominates regardless of S3 cache state.
Updates — what has since been implemented
1. KvikIO S3 transport ✅ (issue #3, merged in PR #2)
Direct S3 throughput benchmark from inside the cluster pod (carbon Americas, 28 files, 3.22 GiB):
| Transport |
Time |
Throughput |
| kvikio pread (KVIKIO_NTHREADS=64, 16 MiB chunks) |
4.1s |
6.25 Gbps |
| Polars Rust object_store |
26.6s |
0.97 Gbps |
6.5× faster S3 download. Key findings that were blocking this:
- `cudf.read_parquet(storage_options=...)` uses PyArrow S3, not kvikio (misleading RAPIDS docs)
- `kvikio.RemoteFile.read()` is single-threaded; `pread()` activates the thread pool
- `set_num_threads()` is silently broken in kvikio 25.02 — must use `KVIKIO_NTHREADS` env var
`gpu-cudf` mode now uses `kvikio.RemoteFile.pread()` → `BytesIO` → `cudf.read_parquet()` with `KVIKIO_NTHREADS=64` set in the deployment. End-to-end Q3a benchmark pending (see below).
2. Partition pruning in gpu-cudf mode ✅ (issue #4, merged in PR #2)
`_scan_cudf()` now extracts h0 predicates from SQL and filters the file list before any reads. Confirmed: Q3a completes without OOM; Q3 (no filter, all 94 files, 7.3 GiB) would OOM without pruning.
Remaining items
-
End-to-end gpu-cudf benchmark — the new kvikio pread pipeline is deployed but not yet benchmarked end-to-end. S3 read is now 4.1s (was 26.6s); the question is whether Q3a GPU drops from 62.5s to competitive with CPU (~14.5s). Results to be added here once available.
-
Larger GPU (A100 80GB / H100 80GB): global Q3-Q5 (full 9.9 GiB carbon) OOM the RTX 4000 Ada. A larger GPU would allow testing the full dataset.
-
Compute-heavy queries: GROUP BY with many groups, window functions, or queries over already-loaded data would favor GPU. S3-read-heavy analytics favor CPU.
Partition pruning in default gpu mode
The default `gpu` mode (Polars lazy) correctly prunes partitions via DPP:
- Q3a (WHERE h0 IN 28 Americas values): completes in 60s, no OOM
- Q3 (no filter, all 94 partitions): OOMKills the pod
Benchmark summary (gpu mode — Polars Rust S3)
Queries: H3 equi-joins on hive-partitioned parquet from Ceph S3 (internal endpoint).
Hardware: RTX 4000 Ada (20 GB VRAM), NRP Nautilus node, 100G InfiniBand.
GPU engine: Polars SQLContext + cuDF (`QUERY_ENGINE=gpu`, default mode).
CPU engine: DuckDB with `SET s3_allow_recursive_globbing=false; SET preserve_insertion_order=false`.
Runs: 3 per query per server; table shows medians.
CPU/GPU < 1 means CPU is faster. DuckDB CPU wins across all queries.
Full results: `benchmarks/results-full.csv`.
Root cause
All queries are S3 I/O bound. The GPU data path has an extra hop:
The PCIe transfer adds ~40-50s for ~3 GiB queries. GPU compute savings are smaller than this overhead.
DuckDB also benefits from OS-level S3 cache on repeated runs (explains high CPU variance: Q3a runs 14.4/26.5/14.5s). GPU is more consistent because PCIe transfer dominates regardless of S3 cache state.
Updates — what has since been implemented
1. KvikIO S3 transport ✅ (issue #3, merged in PR #2)
Direct S3 throughput benchmark from inside the cluster pod (carbon Americas, 28 files, 3.22 GiB):
6.5× faster S3 download. Key findings that were blocking this:
`gpu-cudf` mode now uses `kvikio.RemoteFile.pread()` → `BytesIO` → `cudf.read_parquet()` with `KVIKIO_NTHREADS=64` set in the deployment. End-to-end Q3a benchmark pending (see below).
2. Partition pruning in gpu-cudf mode ✅ (issue #4, merged in PR #2)
`_scan_cudf()` now extracts h0 predicates from SQL and filters the file list before any reads. Confirmed: Q3a completes without OOM; Q3 (no filter, all 94 files, 7.3 GiB) would OOM without pruning.
Remaining items
End-to-end gpu-cudf benchmark — the new kvikio pread pipeline is deployed but not yet benchmarked end-to-end. S3 read is now 4.1s (was 26.6s); the question is whether Q3a GPU drops from 62.5s to competitive with CPU (~14.5s). Results to be added here once available.
Larger GPU (A100 80GB / H100 80GB): global Q3-Q5 (full 9.9 GiB carbon) OOM the RTX 4000 Ada. A larger GPU would allow testing the full dataset.
Compute-heavy queries: GROUP BY with many groups, window functions, or queries over already-loaded data would favor GPU. S3-read-heavy analytics favor CPU.
Partition pruning in default gpu mode
The default `gpu` mode (Polars lazy) correctly prunes partitions via DPP: