Skip to content

Update ByteBuffer to use templated codecs typed access#224

Merged
avalerio-tkd merged 11 commits into
mainfrom
av_typelist_optimizing_059
Mar 6, 2026
Merged

Update ByteBuffer to use templated codecs typed access#224
avalerio-tkd merged 11 commits into
mainfrom
av_typelist_optimizing_059

Conversation

@avalerio-tkd

Copy link
Copy Markdown
Collaborator

No description provided.

@argmarco-tkd argmarco-tkd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this. A bit complex, but necessary. Overall LGTM, but left a few questions.

Comment thread src/processing/typed_buffer_values.h
Comment thread src/processing/typed_buffer_values.h
Comment thread src/processing/typed_buffer.h
Comment thread src/processing/typed_buffer.h
Comment thread src/processing/typed_buffer_codecs.h
Comment on lines +49 to +50
T value;
std::memcpy(&value, read_span.data(), sizeof(T));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes a lot of sense for un-encoded types (such as byte arrays or similar). However, how are we handling the case where types (e.g. INT32) are encoded differently than their in-memory representation (i.e. little endian encoding). (code is a bit hard to follow for that case).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(similar for Encode)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! I'll address endianness in a followup PR. The PR was already pretty loaded, so decided to do that in a following one. Added a TODO note for this.

Basically we will reuse the bytes_utils.h methods already for this.

size_t element_size_bytes_;
};

struct StringVariableSizedCodec {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why 'String'? ('string' nomenclature has the association with "human readable stuff")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Called string for legacy with our own code. Since we've been representing it as cpp strings or string_views sounded natural, but it's a valid point. If it's cumbersome, I can rename to ByteArray maybe? Though ByteArray is too close to Parquet specific name, and this library is agnostic of Parquet/compressions/etc..

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariableSizedCodec may be enough (at least this way we avoid perpetuating the "legacy" string nomenclature)

I prefer to move away from 'string' - but also OK to keep as-is.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already discussed, but the name actually reflects the type more closely (string_view), so keeping it as-is. The RawBytes represent more closely the unformatted variable or fixed byte sequence.

Comment thread src/processing/typed_buffer_codecs.h
Comment thread src/processing/typed_buffer.h

@avalerio-tkd avalerio-tkd left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @argmarco-tkd for the review! Unittests added. Could you PTAL?

Comment thread src/processing/typed_buffer.h
Comment thread src/processing/typed_buffer.h
Comment thread src/processing/typed_buffer.h
Comment on lines +49 to +50
T value;
std::memcpy(&value, read_span.data(), sizeof(T));

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! I'll address endianness in a followup PR. The PR was already pretty loaded, so decided to do that in a following one. Added a TODO note for this.

Basically we will reuse the bytes_utils.h methods already for this.

size_t element_size_bytes_;
};

struct StringVariableSizedCodec {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Called string for legacy with our own code. Since we've been representing it as cpp strings or string_views sounded natural, but it's a valid point. If it's cumbersome, I can rename to ByteArray maybe? Though ByteArray is too close to Parquet specific name, and this library is agnostic of Parquet/compressions/etc..

Comment thread src/processing/typed_buffer_codecs.h
Comment thread src/processing/typed_buffer_values.h

@argmarco-tkd argmarco-tkd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.
Overall LGTM. Left a comment around the two new types introduced in the latest set of commits (TypedBufferRawBytesFixedSized and TypedBufferRawBytesVariableSized).

size_t element_size_bytes_;
};

struct StringVariableSizedCodec {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariableSizedCodec may be enough (at least this way we avoid perpetuating the "legacy" string nomenclature)

I prefer to move away from 'string' - but also OK to keep as-is.

Comment thread src/processing/typed_buffer_values.h
Comment thread src/processing/typed_buffer.h
Comment thread src/processing/typed_buffer.h
// STRING VARIABLE-SIZED
// =============================================================================

TEST(TypedBufferValuesTest, StringVariableSized_ReadBack) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make a difference - but we should add at least one test here where the values are not human-readable 'strings' (this will help humans better understand that 'string' tests are just ByteArrays, and not necessarily human-readable strings.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test, but actually to the contrary (as already discussed). Strings are the human-readable ones, so added some cases with utf-8 chars.

Comment on lines +45 to +46
using TypedBufferRawBytesFixedSized = ByteBuffer<RawBytesFixedSizedCodec>;
using TypedBufferRawBytesVariableSized = ByteBuffer<RawBytesVariableSizedCodec>;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between "String" and "RawBytes" here?

"RawBytes" makes sense for other types (e.g. INTs) where do not want to 'interpret' or 'cast' the bytes into a type. But for byte-array type of data, I'm not sure what the difference is.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, but for the thread. RawBytes is an unformatted byte sequence. String is a fully fledged string, which interprets multi-byte symbols like utf-8 or zero-terminated formatted or other. Concretely RawBytes return a span over uint8, but StringBuffer returns an actual cpp string (a string_view).

These are types are generic, not used yet on Parquet library. Most likely we'll continue using RawBytes when reading the specific BYTE_ARRAY from Parquet. StringBuffer will go unused for Parquet, but it is still handy if we need it.

@sofia-tekdatum sofia-tekdatum left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really excellent work. Thank you for this!!

@avalerio-tkd avalerio-tkd merged commit 0895a45 into main Mar 6, 2026
2 checks passed
@avalerio-tkd avalerio-tkd deleted the av_typelist_optimizing_059 branch March 6, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants