Skip to content

Benchmark results: CPU (DuckDB) outperforms GPU (Polars/cuDF) on S3-backed H3 joins #5

@boettiger-lab-llm-agent

Description

@boettiger-lab-llm-agent

Benchmark summary (gpu mode — Polars Rust S3)

Queries: H3 equi-joins on hive-partitioned parquet from Ceph S3 (internal endpoint).
Hardware: RTX 4000 Ada (20 GB VRAM), NRP Nautilus node, 100G InfiniBand.
GPU engine: Polars SQLContext + cuDF (`QUERY_ENGINE=gpu`, default mode).
CPU engine: DuckDB with `SET s3_allow_recursive_globbing=false; SET preserve_insertion_order=false`.
Runs: 3 per query per server; table shows medians.

Query Data (compressed) GPU median (s) CPU median (s) CPU/GPU ratio
Q1 — IUCN × IUCN (2-table join) ~0.1 GiB 13.9 7.1 0.51×
Q2 — IUCN 3-way join ~0.1 GiB 19.8 12.4 0.63×
Q3a — Americas carbon × IUCN ~3 GiB 62.5 14.5 0.23×
Q4a — Americas carbon × WDPA + GROUP BY ~3 GiB 58.7 32.6 0.56×
Q5a — Americas carbon × 2 IUCN ~3 GiB 53.3 18.5 0.35×

CPU/GPU < 1 means CPU is faster. DuckDB CPU wins across all queries.

Full results: `benchmarks/results-full.csv`.

Root cause

All queries are S3 I/O bound. The GPU data path has an extra hop:

GPU path:  S3 → Polars object_store (CPU) → CPU RAM → PCIe → GPU VRAM → cuDF compute
CPU path:  S3 → DuckDB reader (CPU) → CPU RAM → DuckDB compute

The PCIe transfer adds ~40-50s for ~3 GiB queries. GPU compute savings are smaller than this overhead.

DuckDB also benefits from OS-level S3 cache on repeated runs (explains high CPU variance: Q3a runs 14.4/26.5/14.5s). GPU is more consistent because PCIe transfer dominates regardless of S3 cache state.

Updates — what has since been implemented

1. KvikIO S3 transport ✅ (issue #3, merged in PR #2)

Direct S3 throughput benchmark from inside the cluster pod (carbon Americas, 28 files, 3.22 GiB):

Transport Time Throughput
kvikio pread (KVIKIO_NTHREADS=64, 16 MiB chunks) 4.1s 6.25 Gbps
Polars Rust object_store 26.6s 0.97 Gbps

6.5× faster S3 download. Key findings that were blocking this:

  • `cudf.read_parquet(storage_options=...)` uses PyArrow S3, not kvikio (misleading RAPIDS docs)
  • `kvikio.RemoteFile.read()` is single-threaded; `pread()` activates the thread pool
  • `set_num_threads()` is silently broken in kvikio 25.02 — must use `KVIKIO_NTHREADS` env var

`gpu-cudf` mode now uses `kvikio.RemoteFile.pread()` → `BytesIO` → `cudf.read_parquet()` with `KVIKIO_NTHREADS=64` set in the deployment. End-to-end Q3a benchmark pending (see below).

2. Partition pruning in gpu-cudf mode ✅ (issue #4, merged in PR #2)

`_scan_cudf()` now extracts h0 predicates from SQL and filters the file list before any reads. Confirmed: Q3a completes without OOM; Q3 (no filter, all 94 files, 7.3 GiB) would OOM without pruning.

Remaining items

  1. End-to-end gpu-cudf benchmark — the new kvikio pread pipeline is deployed but not yet benchmarked end-to-end. S3 read is now 4.1s (was 26.6s); the question is whether Q3a GPU drops from 62.5s to competitive with CPU (~14.5s). Results to be added here once available.

  2. Larger GPU (A100 80GB / H100 80GB): global Q3-Q5 (full 9.9 GiB carbon) OOM the RTX 4000 Ada. A larger GPU would allow testing the full dataset.

  3. Compute-heavy queries: GROUP BY with many groups, window functions, or queries over already-loaded data would favor GPU. S3-read-heavy analytics favor CPU.

Partition pruning in default gpu mode

The default `gpu` mode (Polars lazy) correctly prunes partitions via DPP:

  • Q3a (WHERE h0 IN 28 Americas values): completes in 60s, no OOM
  • Q3 (no filter, all 94 partitions): OOMKills the pod

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions