Projection pushdown to parquet#36315
Conversation
82b78f4 to
af01356
Compare
| /// un-demanded columns surface as `Datum::Dummy` (which the MFP cannot | ||
| /// observe by definition of `demand`). |
There was a problem hiding this comment.
At one point we had the goal of removing Datum::Dummy from Materialize because it causes all kinds of trouble. There's an alternative: Rewrite the MFP to project the columns it doesn't require.
| /// columns whose values won't be read downstream. The contained indices refer to | ||
| /// the schema's input column ordering and are stored in ascending order with no | ||
| /// duplicates so callers can use `binary_search` for membership tests. | ||
| pub type ColumnDemand = Arc<Vec<usize>>; |
There was a problem hiding this comment.
If you want a more compact representation:
| pub type ColumnDemand = Arc<Vec<usize>>; | |
| pub type ColumnDemand = Arc<[usize]>; |
|
@claude review |
|
Claude finished @DAlperin's task in 5m 12s —— View job PR Review: Projection pushdown to parquet
Good work overall — the design is sound and the comments are thorough. Below are my findings, roughly ordered by severity. Correctness1. Filter pushdown audit blobs still get projection applied
PendingPart::Unparsed { blob, row_demand } => {
*self = PendingPart::Parsed {
part: blob.parse_with_demand(row_demand.as_ref()), // ← no audit check
};The newly added doc comment on
But 2. Migration override condition for
) if val_migration.is_no_op()
&& key_migration.pure_drops_under_source_data_ok().is_some() =>
{
(Some(demand), PartMigration::SameSchema { both: read.clone() })
}This replaces the migration with 3. No integration test for the The unit tests in Type Design4.
The reviewer comment is correct. // current — two allocations
pub type ColumnDemand = Arc<Vec<usize>>;
// better — single allocation; slice coercion works via Arc::from(&[..])
pub type ColumnDemand = Arc<[usize]>;Constructing: 5.
This trait method is added as a hook for demand-aware decoders, but no code in the codebase calls it (only Design / Plumbing Concerns6.
The inline comment acknowledges: 7.
The Minor8. Metric skipped-bytes on negative compressed sizes
parquet
.projection_skipped_bytes_compressed
.inc_by(compressed.try_into().unwrap_or(0));Parquet's 9.
This looks like a nice cleanup but seems unrelated to the PR. Fine to include, but a separate commit message entry would keep the history clean. 10.
SummaryThe core approach is solid — using
|
When a source/MV reader's `RelationDesc` is narrower than the schema
that produced the part (typically because `compute_apply_column_demands`
pre-narrowed it), the persist source still decompresses and arrow-decodes
every column chunk from parquet before schema migration drops the
un-demanded columns. The bytes themselves arrive as part of the persist
blob get either way (persist does not range-read column chunks); what's
wasted is the decompression and decoding work.
Push column projection down into the parquet reader so un-demanded
sub-fields of `k_s` (the row struct on the key side of source/MV
shards) are never decompressed or arrow-decoded. The demand is the
set of stable indices in the read `RelationDesc`, so the mask matches
exactly what the downstream decoder expects. The mask handles the
`Result<Row, _>` envelope, where row sub-fields appear at
`k_s/ok/<idx>` rather than directly under `k_s`.
When the part's schema migration consists purely of dropping columns
under `k_s/ok` (checked via `Migration::pure_drops_under_source_data_ok`),
`parse_internal` overrides it to `SameSchema` because the
post-projection arrays already match the read shape. Migrations that
add fields, alter nullability, recurse into nested types, or do
Map -> List conversions cannot be replaced by projection alone; on
those parts we fall back to no projection and run the migration
normally.
The wins are decompression CPU and peak transient memory: un-demanded
column chunks are never materialized as arrow arrays, so per-part peak
allocation drops by the size of the dropped columns. There is no
network savings -- column chunks are read as part of the persist blob
get regardless; only their decompression and decoding is skipped.
Metrics: mz_persist_parquet_projection_applied_count,
mz_persist_parquet_projection_no_op_count, and the
projection_skipped_bytes_{compressed,uncompressed} counters track
projection outcomes.
af01356 to
3b727e1
Compare
0dfae65 to
a1bb6cc
Compare
a1bb6cc to
b159aeb
Compare
mtabebe
left a comment
There was a problem hiding this comment.
Comments from me. Nothing too major though
| /// Carried alongside read-path operations to enable parquet column projection | ||
| /// pushdown: leaves under un-demanded stable indices are skipped before | ||
| /// decompression and arrow decode. The contained indices refer to the schema's | ||
| /// input column ordering and are stored in ascending order with no duplicates |
There was a problem hiding this comment.
If sorting is a key property this isn't enforced or communicated in the type.
I know there is a single place that is constructing this now, but I think we are losing that information, since callers explicitly call the binary_search.
Maybe it is overkill, but I would strengthen this type and make the constructor take the sorted input, and move the contains to be on the type so that it is hidden.
| // `RelationDesc::iter_all` walks `metadata` (a `BTreeMap<ColumnIndex, _>`) | ||
| // in stable-index order, so collecting straight into a `Vec` preserves the | ||
| // sorted-ascending invariant `ColumnDemand` requires for `binary_search`. | ||
| let column_demand: Option<ColumnDemand> = if cfg.storage_source_enable_column_projection() { |
There was a problem hiding this comment.
Clean and helpful comment, I appreciate this 👍
| r: impl parquet::file::reader::ChunkReader + 'static, | ||
| format_metadata: Option<&ProtoFormatMetadata>, | ||
| metrics: &ColumnarMetrics, | ||
| row_demand: Option<&[usize]>, |
There was a problem hiding this comment.
Why did you lose the type here?
| builder.with_projection(projection.mask) | ||
| } | ||
| None => { | ||
| // Demand was supplied but every k_s sub-field present in the |
| /// are the user column index as a string (e.g. `"0"`, `"1"`, ...), matching | ||
| /// the layout produced by [`mz_persist_types::columnar::Schema`] | ||
| /// implementations on `RelationDesc`. | ||
| fn build_row_projection_mask(schema: &SchemaDescriptor, demand: &[usize]) -> Option<RowProjection> { |
There was a problem hiding this comment.
This function is the bulk of the logic for this change, deciding what to be masked and what not. I'd love to see some lower level unit tests of this
There was a problem hiding this comment.
I guess there are some tests of the decode_trace_parquet_with_demand which calls this 2 layers deep, but this is sufficiently complex that I think it should be tested.
| // indices that name the leaves under `k_s/ok` in the blob, which | ||
| // `apply_demand` preserves on the read schema, so the post-projection | ||
| // arrays always match the read shape regardless of whether the | ||
| // underlying part needs schema migration. The codec-only path (no |
There was a problem hiding this comment.
Do you have tests for this case? It wasn't obvious to me.
| None | ||
| }; | ||
|
|
||
| // Decide whether projection pushdown can substitute for the part's |
There was a problem hiding this comment.
Just restating so I understand:
when schema migration occurs (adding/removing columns), we can't assume the masked columns refer to the same thing. It is only valid to keep the masking if the projection remains the same.
That makes sense. Would it be useful to add a metric to when we get rid of masking in these cases?
It feels like it should spike and then go back to 0 later, right?
When a source/MV reader's
RelationDescis narrower than the schemathat produced the part (typically because
compute_apply_column_demandspre-narrowed it), the persist source still decompresses and arrow-decodes
every column chunk from parquet before schema migration drops the
un-demanded columns. The bytes themselves arrive as part of the persist
blob get either way (persist does not range-read column chunks); what's
wasted is the decompression and decoding work.
Push column projection down into the parquet reader so un-demanded
sub-fields of
k_s(the row struct on the key side of source/MVshards) are never decompressed or arrow-decoded. The demand is the
set of stable indices in the read
RelationDesc, so the mask matchesexactly what the downstream decoder expects. The mask handles the
Result<Row, _>envelope, where row sub-fields appear atk_s/ok/<idx>rather than directly underk_s.When the part's schema migration consists purely of dropping columns
under
k_s/ok(checked viaMigration::pure_drops_under_source_data_ok),parse_internaloverrides it toSameSchemabecause thepost-projection arrays already match the read shape. Migrations that
add fields, alter nullability, recurse into nested types, or do
Map -> List conversions cannot be replaced by projection alone; on
those parts we fall back to no projection and run the migration
normally.
The wins are decompression CPU and peak transient memory: un-demanded
column chunks are never materialized as arrow arrays, so per-part peak
allocation drops by the size of the dropped columns. There is no
network savings -- column chunks are read as part of the persist blob
get regardless; only their decompression and decoding is skipped.
Metrics: mz_persist_parquet_projection_applied_count,
mz_persist_parquet_projection_no_op_count, and the
projection_skipped_bytes_{compressed,uncompressed} counters track
projection outcomes.