Update ByteBuffer to use templated codecs typed access by avalerio-tkd · Pull Request #224 · protegrity/DataBatchProtectionService

avalerio-tkd · 2026-03-06T03:46:44Z

No description provided.

…g elements with type access.

argmarco-tkd

Thank you for this. A bit complex, but necessary. Overall LGTM, but left a few questions.

argmarco-tkd · 2026-03-06T16:30:22Z

+        T value;
+        std::memcpy(&value, read_span.data(), sizeof(T));


this makes a lot of sense for un-encoded types (such as byte arrays or similar). However, how are we handling the case where types (e.g. INT32) are encoded differently than their in-memory representation (i.e. little endian encoding). (code is a bit hard to follow for that case).

(similar for Encode)

Thanks for catching that! I'll address endianness in a followup PR. The PR was already pretty loaded, so decided to do that in a following one. Added a TODO note for this.

Basically we will reuse the bytes_utils.h methods already for this.

argmarco-tkd · 2026-03-06T16:32:03Z

+        size_t element_size_bytes_;
+};
+
+struct StringVariableSizedCodec {


nit: why 'String'? ('string' nomenclature has the association with "human readable stuff")

Called string for legacy with our own code. Since we've been representing it as cpp strings or string_views sounded natural, but it's a valid point. If it's cumbersome, I can rename to ByteArray maybe? Though ByteArray is too close to Parquet specific name, and this library is agnostic of Parquet/compressions/etc..

VariableSizedCodec may be enough (at least this way we avoid perpetuating the "legacy" string nomenclature)

I prefer to move away from 'string' - but also OK to keep as-is.

Already discussed, but the name actually reflects the type more closely (string_view), so keeping it as-is. The RawBytes represent more closely the unformatted variable or fixed byte sequence.

avalerio-tkd

Thanks @argmarco-tkd for the review! Unittests added. Could you PTAL?

avalerio-tkd · 2026-03-06T19:58:04Z

+        T value;
+        std::memcpy(&value, read_span.data(), sizeof(T));


Thanks for catching that! I'll address endianness in a followup PR. The PR was already pretty loaded, so decided to do that in a following one. Added a TODO note for this.

Basically we will reuse the bytes_utils.h methods already for this.

avalerio-tkd · 2026-03-06T20:02:25Z

+        size_t element_size_bytes_;
+};
+
+struct StringVariableSizedCodec {


Called string for legacy with our own code. Since we've been representing it as cpp strings or string_views sounded natural, but it's a valid point. If it's cumbersome, I can rename to ByteArray maybe? Though ByteArray is too close to Parquet specific name, and this library is agnostic of Parquet/compressions/etc..

argmarco-tkd

Thanks.
Overall LGTM. Left a comment around the two new types introduced in the latest set of commits (TypedBufferRawBytesFixedSized and TypedBufferRawBytesVariableSized).

argmarco-tkd · 2026-03-06T20:28:56Z

+        size_t element_size_bytes_;
+};
+
+struct StringVariableSizedCodec {


VariableSizedCodec may be enough (at least this way we avoid perpetuating the "legacy" string nomenclature)

I prefer to move away from 'string' - but also OK to keep as-is.

argmarco-tkd · 2026-03-06T21:06:59Z

+// STRING VARIABLE-SIZED
+// =============================================================================
+
+TEST(TypedBufferValuesTest, StringVariableSized_ReadBack) {


It does not make a difference - but we should add at least one test here where the values are not human-readable 'strings' (this will help humans better understand that 'string' tests are just ByteArrays, and not necessarily human-readable strings.)

Added the test, but actually to the contrary (as already discussed). Strings are the human-readable ones, so added some cases with utf-8 chars.

argmarco-tkd · 2026-03-06T21:10:52Z

+using TypedBufferRawBytesFixedSized = ByteBuffer<RawBytesFixedSizedCodec>;
+using TypedBufferRawBytesVariableSized = ByteBuffer<RawBytesVariableSizedCodec>;


What is the difference between "String" and "RawBytes" here?

"RawBytes" makes sense for other types (e.g. INTs) where do not want to 'interpret' or 'cast' the bytes into a type. But for byte-array type of data, I'm not sure what the difference is.

Discussed offline, but for the thread. RawBytes is an unformatted byte sequence. String is a fully fledged string, which interprets multi-byte symbols like utf-8 or zero-terminated formatted or other. Concretely RawBytes return a span over uint8, but StringBuffer returns an actual cpp string (a string_view).

These are types are generic, not used yet on Parquet library. Most likely we'll continue using RawBytes when reading the specific BYTE_ARRAY from Parquet. StringBuffer will go unused for Parquet, but it is still handy if we need it.

sofia-tekdatum

This is really excellent work. Thank you for this!!

avalerio-tkd added 7 commits March 5, 2026 21:43

- Updating ByteBuffer to use type-based codecs for getting and settin…

7766921

…g elements with type access.

- Fixing bad copy-paste error in ByteBuffer.h

1ab79b0

- Renaming byte_buffer.cpp to typed_buffer.h

d85e6ad

- Renaming byte_buffer.cpp to typed_buffer.h (again)

9cd45a0

- Reorganizing typed buffer code

003fa62

- Reorganizing typed buffer code (adding missing files)

97f1d85

- Compilation fixes in typed buffer code

8bfebae

avalerio-tkd requested review from argmarco-tkd and sofia-tekdatum March 6, 2026 13:27

argmarco-tkd reviewed Mar 6, 2026

View reviewed changes

avalerio-tkd added 3 commits March 6, 2026 13:11

- Updating byte_buffer unittests for the templated version.

3da8fa5

- Added unittests for the type-specific byte_buffer implementations.

a887e5d

- Minor code cleanup and inline comments.

82530a4

avalerio-tkd commented Mar 6, 2026

View reviewed changes

argmarco-tkd reviewed Mar 6, 2026

View reviewed changes

argmarco-tkd approved these changes Mar 6, 2026

View reviewed changes

sofia-tekdatum approved these changes Mar 6, 2026

View reviewed changes

- Adding UTF-8 tests for string variable-sized buffers

66de5de

avalerio-tkd merged commit 0895a45 into main Mar 6, 2026
2 checks passed

avalerio-tkd deleted the av_typelist_optimizing_059 branch March 6, 2026 22:58

avalerio-tkd mentioned this pull request Mar 19, 2026

Optimize memory buffers on DBPS EncryptionSequencer libraries #218

Closed

		using TypedBufferRawBytesFixedSized = ByteBuffer<RawBytesFixedSizedCodec>;
		using TypedBufferRawBytesVariableSized = ByteBuffer<RawBytesVariableSizedCodec>;

Conversation

avalerio-tkd commented Mar 6, 2026

Uh oh!

argmarco-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

avalerio-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

argmarco-tkd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sofia-tekdatum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants