All datasets are hive-partitioned by h0. Always include h0 in every join condition:
JOIN table2 ON table1.hX = table2.hX AND table1.h0 = table2.h0where hX is the finest resolution shared by both datasets (h8, h9, etc. — check the
schema). Omitting AND t1.h0 = t2.h0 forces the engine to scan every partition file on S3
instead of only the matching ones (10-100x slower).
Use regions/hex/** or countries.parquet as the first CTE to establish geographic
scope before joining large thematic datasets (PADUS, carbon, wetlands, species).
WITH scope AS (
SELECT DISTINCT h8, h0
FROM read_parquet('s3://public-overturemaps/regions/hex/**')
WHERE region = 'US-CA'
),
parks AS (
SELECT DISTINCT p.h8, p.h0
FROM scope s
JOIN read_parquet('s3://public-padus/padus-4-1/fee/hex/**') p
ON s.h8 = p.h8 AND s.h0 = p.h0
WHERE p.Des_Tp = 'NP'
)
SELECT SUM(c.carbon)/1e6
FROM parks p
JOIN read_parquet('s3://public-carbon/vulnerable-carbon-2024/hex/**') c
ON p.h8 = c.h8 AND p.h0 = c.h0Do NOT use h3_cell_to_parent() or h3_h3_to_string() — these functions are not
available in the GPU engine. All datasets have pre-computed H3 columns at multiple
resolutions (h0, h8, h9, etc.). For cross-resolution joins, use the coarser shared column.
Note: rook-ceph-rgw-nautiluss3.rook is an internal endpoint only accessible from k8s. Always use it — not the public endpoint — to run queries.
You must read parquet datasets from S3 using read_parquet(). There are no local tables.