Skip to content

Add AnyList<L>: one logical type for any list encoding#8

Merged
emilk merged 8 commits into
mainfrom
emilk/any-list
Jun 9, 2026
Merged

Add AnyList<L>: one logical type for any list encoding#8
emilk merged 8 commits into
mainfrom
emilk/any-list

Conversation

@emilk

@emilk emilk commented Jun 8, 2026

Copy link
Copy Markdown
Member

Arrow has five physically different ways to store the same logical thing — a column of lists of L: List, LargeList, ListView, LargeListView, and FixedSizeList. They already share the same Datatype::Value (ListValue<'a, L>) and Owned (Vec<L::Owned>) — only the physical encoding differs.

AnyList<L> unifies them: Column<AnyList<L>> accepts whichever encoding it is handed at parse time and reads them all uniformly, dispatching at runtime over an internal enum of the per-encoding typed representations.

let column = Column::<AnyList<Utf8>>::try_from(array)?;
// `array` may be List / LargeList / ListView / LargeListView / FixedSizeList:
for list in &column {
    for s in list { /* &str */ }
}

Design

  • AnyTypedList<L> is an enum over the five typed reps. Element access (value/is_null/iterate) delegates to whichever encoding — reusing the existing Datatype impls for the four variable-length encodings, and reading the per-row length at runtime for FixedSizeList (whose size is a const generic the dynamic type can't carry).
  • An enum, not a dyn trait object: no boxing, and the uniform Value type composes (nesting, Option, etc.).
  • matches() accepts all five datatypes with a matching inner L; build/from_values emits the canonical List.
  • Clone is hand-written (deriving would add a spurious L: Clone bound; only L::Typed: Clone is needed) — consistent with TypedList & co.

Verification

  • cargo clippy --all-features --all-targets clean, cargo fmt --all applied
  • cargo test --all-features green, incl. a new any_list_columns test: build → canonical List; try_from each of the five encodings read uniformly; non-list rejected; item-nullability enforced regardless of encoding; null rows via Option<AnyList<…>>
  • cargo doc --document-private-items -D warnings clean

🤖 Generated with Claude Code

emilk and others added 2 commits June 8, 2026 14:23
All five arrow list encodings (`List`, `LargeList`, `ListView`,
`LargeListView`, `FixedSizeList`) store the same logical thing — a column of
lists of `L` — and already share the same `Datatype::Value` (`ListValue`) and
`Owned` (`Vec<L::Owned>`). `Column<AnyList<L>>` accepts whichever encoding it is
handed at parse time and reads them all uniformly, dispatching at runtime over
an internal enum of the per-encoding typed reps. Building emits the canonical
`List`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two related changes that let `AnyList` exist honestly:

1. Rename the `Datatype` trait family to avoid the case-only clash with arrow's
   `DataType`: `Datatype`→`LogicalType`, `PrimitiveDatatype`→`PrimitiveType`,
   `RefDatatype`→`RefType`.

2. Split out a `ConcreteType: LogicalType` trait holding `datatype()` and
   `build()` — the two operations that need a *single* concrete arrow datatype.
   `LogicalType` keeps only the read/parse contract (`matches`, `downcast`,
   `value`, …) plus a new `expected_datatype()` description for error messages.

   `AnyList` now implements only `LogicalType` (parse-only): it accepts several
   arrow encodings and has no single datatype to build or report. Building,
   `Default`, `Column::datatype()`, and schema generation are gated on
   `ConcreteType`. `WrongDatatype.expected` becomes a `String` (so the generic
   parse path needn't produce a concrete `DataType`), and `InfallibleBuild`,
   `DictionaryKey`, and `RunEndType` now require `ConcreteType`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@emilk emilk marked this pull request as draft June 8, 2026 15:22
emilk and others added 4 commits June 9, 2026 11:47
Replace `LogicalType::expected_datatype() -> String` with
`supported_datatypes() -> Vec<DataType>`, and `WrongDatatype.expected: String`
with `supported: Vec<DataType>`. Containers build their set recursively from
the inner type's `supported_datatypes()` (exact for concrete inners); `AnyList`
lists its four offset-based encodings (a `FixedSizeList` of any size can't be
enumerated, but `matches` still accepts it). The error message joins them
("Expected Utf8, …" for one, "Expected one of [...]" for several).

This is only used for error messages — datatype acceptance is still decided by
`matches`, which also validates parameters the arrow array type can't encode
(fixed sizes, timestamp timezones).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_datatypes`

`downcast` is now the single validation+downcast hook: it rejects a wrong
datatype (returning `WrongDatatype`), *including* parameters the concrete arrow
array's Rust type doesn't encode — a `FixedSizeBinary`/`FixedSizeList` size and
a `Timestamp` timezone now get an explicit check; everything else is covered by
`downcast_array` (leaf type) plus recursion (nested element types). `try_new`
no longer pre-checks via `matches`.

Removes `LogicalType::matches`, `LogicalType::supported_datatypes`, and the
`datatypes_compatible` helper. `ColumnError`/`ErrorKind::WrongDatatype` now carry
just `actual: DataType` (the expected type is the caller's `L`, known from
context). Custom multi-datatype types (e.g. the `AnyInt` test) and `AnyList`
do their acceptance directly in `downcast`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ColumnError`/`ErrorKind::WrongDatatype` carry `expected: String` again, but now
it is produced by the failing `downcast` itself (no `matches`/`supported_datatypes`
needed): `downcast_array` takes a lazy `|| -> String` describing the expected
datatype, and the parameter checks (`FixedSizeBinary`/`FixedSizeList` size,
`Timestamp` timezone) and custom/`AnyList` downcasts supply their own. The error
points at the precise failing level (e.g. inner `Utf8` for a `List<Utf8>`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@emilk emilk marked this pull request as ready for review June 9, 2026 11:17
emilk and others added 2 commits June 9, 2026 13:24
Validate the datatype (via `downcast`) before the top-level null check, so a
wrong-typed array that also has nulls reports the datatype mismatch rather than
masking it as `UnexpectedNulls`. Matches the pre-refactor behavior (which ran
`matches` first). Added a test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It is outside arrow's core data model (a quiver-only logical type with no single
arrow datatype), so it warrants its own short section: accepts any list encoding,
reads uniformly, parse-only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@emilk emilk merged commit 3e2e02c into main Jun 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant