Implementing BasicEncryptor with the optimized TypedBuffers#226
Conversation
avalerio-tkd
commented
Mar 8, 2026
- Adding new BasicXOREncryptor implementation (transitional name before renaming to BasicEncryptor).
- Implementing BasicXOREncryptor with typed buffer optimizations.
- Added methods to TypedBuffer to get the write span for an element during SetElement calls
- Added mehtods to TypedBuffer for getters of attributes.
…re renaming to BasicEncryptor). - Implementing BasicXOREncryptor with typed buffer optimizations. - Added methods to TypedBuffer to get the write span for an element during SetElement calls - Added mehtods to TypedBuffer for getters of attributes.
- Updating unittests - Including a temporary XorEncryptorInterface during the migration.
argmarco-tkd
left a comment
There was a problem hiding this comment.
Thank you for this. Overall the code seems correct, but I do have concerns around code complexity (left a few specific comments)
| << " user=" << user_id_ << " key=" << key_id_ | ||
| << " datatype=" << dbps::enum_utils::to_string(datatype_) << std::endl; | ||
|
|
||
| return std::visit([&](const auto& input_buffer) -> std::vector<uint8_t> { |
There was a problem hiding this comment.
can we simplify this code? (e.g. remove the use of lambdas?) BasicEncryptor is supposed to be an example encryptor - this implementation makes legibility a bit hard.
There was a problem hiding this comment.
Great comment. Totally agree on keeping BasicEncryptor as readable as possible.
However, we can't remove this one since it's a needed visit due to TypeBuffer overloading. There are workarounds but all end up doing a visit somewhere, just placed somewhere else. This is in general a known boilerplate pattern for accessing variant types in cpp.
There was a problem hiding this comment.
Discussed offline. Added a TODO note in-place for this and will address it in a followup cleanup.
| encrypted_bytes, kFixedHeaderLength, | ||
| RawBytesFixedSizedCodec{header.element_size}}; | ||
|
|
||
| auto decrypt_fixed_into = [&](auto output) { |
There was a problem hiding this comment.
simillar than for the encrypt. Can we optimize for legibility here?
There was a problem hiding this comment.
Moved it to a separate helper function.
| return output; | ||
| }; | ||
|
|
||
| // TODO: This is leaking Parquet-specific types into the encryptor, which should be agnostic of Parquet. |
There was a problem hiding this comment.
thanks for the call out. why was this not needed before?
There was a problem hiding this comment.
The previous implementation had the same dependency, just that it was harder to read because it was used indirectly by a helper function on parquet_utils, so harder to detect. I didn't realize it either.
I have a possible solution in mind that we can discuss offline. The gist is we can add a type annotation to the output. This can come from the Codec that generates it, could be as simple as a unique byte value. This would be protected by the version check if the Codec code changes.
There was a problem hiding this comment.
Discussed offline. The TODO note capture the pending item. Will leave as-is for this PR and will address in a followup.
| void SetRawElement(size_t position, tcb::span<const uint8_t> raw); | ||
|
|
||
| // Getters for immediately available properties. | ||
| size_t GetSpanSize() const { return elements_span_.size(); } |
There was a problem hiding this comment.
GetSpanSize seems like leakage of impl details - can a simple 'GetSize()` do?
There was a problem hiding this comment.
Renamed to GetRawBufferSize(). GetSize() is ambiguous/too close to other values related to ElementSize.
avalerio-tkd
left a comment
There was a problem hiding this comment.
Thanks for the review and the comments. Will followup to chat offline and make decisions of code complexity VS efficiencies trade-offs, all valid concerns on the review.
| << " user=" << user_id_ << " key=" << key_id_ | ||
| << " datatype=" << dbps::enum_utils::to_string(datatype_) << std::endl; | ||
|
|
||
| return std::visit([&](const auto& input_buffer) -> std::vector<uint8_t> { |
There was a problem hiding this comment.
Great comment. Totally agree on keeping BasicEncryptor as readable as possible.
However, we can't remove this one since it's a needed visit due to TypeBuffer overloading. There are workarounds but all end up doing a visit somewhere, just placed somewhere else. This is in general a known boilerplate pattern for accessing variant types in cpp.
| encrypted_bytes, kFixedHeaderLength, | ||
| RawBytesFixedSizedCodec{header.element_size}}; | ||
|
|
||
| auto decrypt_fixed_into = [&](auto output) { |
There was a problem hiding this comment.
Moved it to a separate helper function.
| return output; | ||
| }; | ||
|
|
||
| // TODO: This is leaking Parquet-specific types into the encryptor, which should be agnostic of Parquet. |
There was a problem hiding this comment.
The previous implementation had the same dependency, just that it was harder to read because it was used indirectly by a helper function on parquet_utils, so harder to detect. I didn't realize it either.
I have a possible solution in mind that we can discuss offline. The gist is we can add a type annotation to the output. This can come from the Codec that generates it, could be as simple as a unique byte value. This would be protected by the version check if the Codec code changes.
| void SetRawElement(size_t position, tcb::span<const uint8_t> raw); | ||
|
|
||
| // Getters for immediately available properties. | ||
| size_t GetSpanSize() const { return elements_span_.size(); } |
There was a problem hiding this comment.
Renamed to GetRawBufferSize(). GetSize() is ambiguous/too close to other values related to ElementSize.
- Added protection to GetWriteSpanForElement to prevent an object copy by misbehaved callers.
|
Updated after offline discussions. @argmarco-tkd could you PTAL? |
argmarco-tkd
left a comment
There was a problem hiding this comment.
Thanks for the discussion and the changes. LGTM, Ship it!
- Added cached current_element_size_ to TypedBuffer iterator.