Add AnyList<L>: one logical type for any list encoding#8
Merged
Conversation
All five arrow list encodings (`List`, `LargeList`, `ListView`, `LargeListView`, `FixedSizeList`) store the same logical thing — a column of lists of `L` — and already share the same `Datatype::Value` (`ListValue`) and `Owned` (`Vec<L::Owned>`). `Column<AnyList<L>>` accepts whichever encoding it is handed at parse time and reads them all uniformly, dispatching at runtime over an internal enum of the per-encoding typed reps. Building emits the canonical `List`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two related changes that let `AnyList` exist honestly: 1. Rename the `Datatype` trait family to avoid the case-only clash with arrow's `DataType`: `Datatype`→`LogicalType`, `PrimitiveDatatype`→`PrimitiveType`, `RefDatatype`→`RefType`. 2. Split out a `ConcreteType: LogicalType` trait holding `datatype()` and `build()` — the two operations that need a *single* concrete arrow datatype. `LogicalType` keeps only the read/parse contract (`matches`, `downcast`, `value`, …) plus a new `expected_datatype()` description for error messages. `AnyList` now implements only `LogicalType` (parse-only): it accepts several arrow encodings and has no single datatype to build or report. Building, `Default`, `Column::datatype()`, and schema generation are gated on `ConcreteType`. `WrongDatatype.expected` becomes a `String` (so the generic parse path needn't produce a concrete `DataType`), and `InfallibleBuild`, `DictionaryKey`, and `RunEndType` now require `ConcreteType`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace `LogicalType::expected_datatype() -> String` with
`supported_datatypes() -> Vec<DataType>`, and `WrongDatatype.expected: String`
with `supported: Vec<DataType>`. Containers build their set recursively from
the inner type's `supported_datatypes()` (exact for concrete inners); `AnyList`
lists its four offset-based encodings (a `FixedSizeList` of any size can't be
enumerated, but `matches` still accepts it). The error message joins them
("Expected Utf8, …" for one, "Expected one of [...]" for several).
This is only used for error messages — datatype acceptance is still decided by
`matches`, which also validates parameters the arrow array type can't encode
(fixed sizes, timestamp timezones).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_datatypes` `downcast` is now the single validation+downcast hook: it rejects a wrong datatype (returning `WrongDatatype`), *including* parameters the concrete arrow array's Rust type doesn't encode — a `FixedSizeBinary`/`FixedSizeList` size and a `Timestamp` timezone now get an explicit check; everything else is covered by `downcast_array` (leaf type) plus recursion (nested element types). `try_new` no longer pre-checks via `matches`. Removes `LogicalType::matches`, `LogicalType::supported_datatypes`, and the `datatypes_compatible` helper. `ColumnError`/`ErrorKind::WrongDatatype` now carry just `actual: DataType` (the expected type is the caller's `L`, known from context). Custom multi-datatype types (e.g. the `AnyInt` test) and `AnyList` do their acceptance directly in `downcast`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ColumnError`/`ErrorKind::WrongDatatype` carry `expected: String` again, but now it is produced by the failing `downcast` itself (no `matches`/`supported_datatypes` needed): `downcast_array` takes a lazy `|| -> String` describing the expected datatype, and the parameter checks (`FixedSizeBinary`/`FixedSizeList` size, `Timestamp` timezone) and custom/`AnyList` downcasts supply their own. The error points at the precise failing level (e.g. inner `Utf8` for a `List<Utf8>`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validate the datatype (via `downcast`) before the top-level null check, so a wrong-typed array that also has nulls reports the datatype mismatch rather than masking it as `UnexpectedNulls`. Matches the pre-refactor behavior (which ran `matches` first). Added a test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
It is outside arrow's core data model (a quiver-only logical type with no single arrow datatype), so it warrants its own short section: accepts any list encoding, reads uniformly, parse-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Arrow has five physically different ways to store the same logical thing — a column of lists of
L:List,LargeList,ListView,LargeListView, andFixedSizeList. They already share the sameDatatype::Value(ListValue<'a, L>) andOwned(Vec<L::Owned>) — only the physical encoding differs.AnyList<L>unifies them:Column<AnyList<L>>accepts whichever encoding it is handed at parse time and reads them all uniformly, dispatching at runtime over an internal enum of the per-encoding typed representations.Design
AnyTypedList<L>is an enum over the five typed reps. Element access (value/is_null/iterate) delegates to whichever encoding — reusing the existingDatatypeimpls for the four variable-length encodings, and reading the per-row length at runtime forFixedSizeList(whose size is a const generic the dynamic type can't carry).dyntrait object: no boxing, and the uniformValuetype composes (nesting,Option, etc.).matches()accepts all five datatypes with a matching innerL;build/from_valuesemits the canonicalList.Cloneis hand-written (deriving would add a spuriousL: Clonebound; onlyL::Typed: Cloneis needed) — consistent withTypedList& co.Verification
cargo clippy --all-features --all-targetsclean,cargo fmt --allappliedcargo test --all-featuresgreen, incl. a newany_list_columnstest: build → canonicalList;try_fromeach of the five encodings read uniformly; non-list rejected; item-nullability enforced regardless of encoding; null rows viaOption<AnyList<…>>cargo doc --document-private-items -D warningsclean🤖 Generated with Claude Code