I’ve encountered a fallback when filtering on a column using %in% with many values, but only when the data is lazily read from a Parquet file.
Reprex:
suppressMessages({
library(duckplyr)
library(dplyr)
library(arrow)
library(fs)
})
# Create a simple Parquet file
tmp <- dir_create(file_temp())
file <- path(tmp, "test.parquet")
in_memory_duckdb_df <-
tibble(id = 1:1000) |>
compute_parquet(file)
scanned_parquet_df <- read_parquet_duckdb(file, prudence = "stingy")
many_values <- 1:500
in_memory_duckdb_df |> filter(id %in% many_values) # no issues
#> # A duckplyr data frame: 1 variable
#> id
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
#> # ℹ more rows
scanned_parquet_df |> filter(id %in% many_values)
#> Error in `filter()`:
#> ! This operation cannot be carried out by DuckDB, and the input is a
#> stingy duckplyr frame.
#> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and
#> continue with duckplyr.
#> ℹ See `vignette("prudence")` for other options.
#> Caused by error:
#> ! Materialization is disabled, use collect() or as_tibble() to materialize.
scanned_parquet_df |> filter(id %in% many_values[1:10]) |> explain()
#> ┌───────────────────────────┐
#> │ READ_PARQUET │
#> │ ──────────────────── │
#> │ Function: │
#> │ READ_PARQUET │
#> │ │
#> │ Projections: id │
#> │ │
#> │ Filters: │
#> │ COALESCE(((((((((("r_base:│
#> │ :=="(id, 1) OR "r_base::==│
#> │ "(id, 2)) OR "r_base::==" │
#> │ (id, 3)) OR "r_base::==" │
#> │ (id, 4)) OR "r_base::==" │
#> │ (id, 5)) OR "r_base::==" │
#> │ (id, 6)) OR "r_base::==" │
#> │ (id, 7)) OR "r_base::==" │
#> │ (id, 8)) OR "r_base::==" │
#> │ (id, 9)) OR "r_base::==" │
#> │ (id, 10)), false) │
#> │ │
#> │ ~200 Rows │
#> └───────────────────────────┘
Created on 2025-08-05 with reprex v2.1.1
As you can see in the reprex output, when filtering on a smaller number of values, the translation results in a collection of OR == statements which might be the root cause. I'm wondering why simply using the "IN" operator wouldn't do the job better?
I’ve encountered a fallback when filtering on a column using
%in%with many values, but only when the data is lazily read from a Parquet file.Reprex:
suppressMessages({ library(duckplyr) library(dplyr) library(arrow) library(fs) }) # Create a simple Parquet file tmp <- dir_create(file_temp()) file <- path(tmp, "test.parquet") in_memory_duckdb_df <- tibble(id = 1:1000) |> compute_parquet(file) scanned_parquet_df <- read_parquet_duckdb(file, prudence = "stingy") many_values <- 1:500 in_memory_duckdb_df |> filter(id %in% many_values) # no issues #> # A duckplyr data frame: 1 variable #> id #> <int> #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5 #> 6 6 #> 7 7 #> 8 8 #> 9 9 #> 10 10 #> # ℹ more rows scanned_parquet_df |> filter(id %in% many_values) #> Error in `filter()`: #> ! This operation cannot be carried out by DuckDB, and the input is a #> stingy duckplyr frame. #> ℹ Use `compute(prudence = "lavish")` to materialize to temporary storage and #> continue with duckplyr. #> ℹ See `vignette("prudence")` for other options. #> Caused by error: #> ! Materialization is disabled, use collect() or as_tibble() to materialize. scanned_parquet_df |> filter(id %in% many_values[1:10]) |> explain() #> ┌───────────────────────────┐ #> │ READ_PARQUET │ #> │ ──────────────────── │ #> │ Function: │ #> │ READ_PARQUET │ #> │ │ #> │ Projections: id │ #> │ │ #> │ Filters: │ #> │ COALESCE(((((((((("r_base:│ #> │ :=="(id, 1) OR "r_base::==│ #> │ "(id, 2)) OR "r_base::==" │ #> │ (id, 3)) OR "r_base::==" │ #> │ (id, 4)) OR "r_base::==" │ #> │ (id, 5)) OR "r_base::==" │ #> │ (id, 6)) OR "r_base::==" │ #> │ (id, 7)) OR "r_base::==" │ #> │ (id, 8)) OR "r_base::==" │ #> │ (id, 9)) OR "r_base::==" │ #> │ (id, 10)), false) │ #> │ │ #> │ ~200 Rows │ #> └───────────────────────────┘Created on 2025-08-05 with reprex v2.1.1
As you can see in the reprex output, when filtering on a smaller number of values, the translation results in a collection of OR == statements which might be the root cause. I'm wondering why simply using the "IN" operator wouldn't do the job better?