Skip to content

add_columns should support reading existing Blob v2 columns in UDFs #7168

@yyzhao2025

Description

@yyzhao2025

Problem

Dataset.add_columns(..., read_columns=["blob"]) fails when blob is an existing Blob v2 column.

This affects UDF-based schema evolution where the new column is derived from an existing Blob v2 descriptor column.

Reproduction

import pyarrow as pa
import lance

values = [
    b"inline",
    b"p" * (64 * 1024 + 1024),
    b"d" * (4 * 1024 * 1024 + 1024),
    external_blob.as_uri(),
]

ds = lance.write_dataset(
    pa.table({"id": range(4), "blob": lance.blob_array(values)}),
    uri,
    data_storage_version="2.2",
    initial_bases=[
        lance.DatasetBasePath(external_base.as_uri(), name="external", id=1)
    ],
)

@lance.batch_udf(output_schema=pa.schema([pa.field("blob_kind", pa.int32())]))
def blob_kind(batch):
    return pa.record_batch([batch["blob"].field("kind")], ["blob_kind"])

ds.add_columns(blob_kind, read_columns=["blob"])

Error

OSError: Invalid user input: there were more fields in the schema than provided column indices / infos,
rust/lance-encoding/src/decoder.rs:454:13

Expected behavior

add_columns should be able to read an existing Blob v2 column as a descriptor struct when it is listed in read_columns.

The UDF should receive the Blob v2 descriptor batch, and derived columns should be written successfully.

Notes

This is separate from writing new Blob v2 columns through add_columns.

The existing Blob v2 add_columns tests cover writing new Blob v2 values through RecordBatchReader and BatchUDF, including inline, packed, dedicated, and external. This issue is about reading an existing Blob v2 column during the UDF input scan.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions