Update ByteBuffer to use templated codecs typed access#224
Conversation
…g elements with type access.
argmarco-tkd
left a comment
There was a problem hiding this comment.
Thank you for this. A bit complex, but necessary. Overall LGTM, but left a few questions.
| T value; | ||
| std::memcpy(&value, read_span.data(), sizeof(T)); |
There was a problem hiding this comment.
this makes a lot of sense for un-encoded types (such as byte arrays or similar). However, how are we handling the case where types (e.g. INT32) are encoded differently than their in-memory representation (i.e. little endian encoding). (code is a bit hard to follow for that case).
There was a problem hiding this comment.
Thanks for catching that! I'll address endianness in a followup PR. The PR was already pretty loaded, so decided to do that in a following one. Added a TODO note for this.
Basically we will reuse the bytes_utils.h methods already for this.
| size_t element_size_bytes_; | ||
| }; | ||
|
|
||
| struct StringVariableSizedCodec { |
There was a problem hiding this comment.
nit: why 'String'? ('string' nomenclature has the association with "human readable stuff")
There was a problem hiding this comment.
Called string for legacy with our own code. Since we've been representing it as cpp strings or string_views sounded natural, but it's a valid point. If it's cumbersome, I can rename to ByteArray maybe? Though ByteArray is too close to Parquet specific name, and this library is agnostic of Parquet/compressions/etc..
There was a problem hiding this comment.
VariableSizedCodec may be enough (at least this way we avoid perpetuating the "legacy" string nomenclature)
I prefer to move away from 'string' - but also OK to keep as-is.
There was a problem hiding this comment.
Already discussed, but the name actually reflects the type more closely (string_view), so keeping it as-is. The RawBytes represent more closely the unformatted variable or fixed byte sequence.
avalerio-tkd
left a comment
There was a problem hiding this comment.
Thanks @argmarco-tkd for the review! Unittests added. Could you PTAL?
| T value; | ||
| std::memcpy(&value, read_span.data(), sizeof(T)); |
There was a problem hiding this comment.
Thanks for catching that! I'll address endianness in a followup PR. The PR was already pretty loaded, so decided to do that in a following one. Added a TODO note for this.
Basically we will reuse the bytes_utils.h methods already for this.
| size_t element_size_bytes_; | ||
| }; | ||
|
|
||
| struct StringVariableSizedCodec { |
There was a problem hiding this comment.
Called string for legacy with our own code. Since we've been representing it as cpp strings or string_views sounded natural, but it's a valid point. If it's cumbersome, I can rename to ByteArray maybe? Though ByteArray is too close to Parquet specific name, and this library is agnostic of Parquet/compressions/etc..
argmarco-tkd
left a comment
There was a problem hiding this comment.
Thanks.
Overall LGTM. Left a comment around the two new types introduced in the latest set of commits (TypedBufferRawBytesFixedSized and TypedBufferRawBytesVariableSized).
| size_t element_size_bytes_; | ||
| }; | ||
|
|
||
| struct StringVariableSizedCodec { |
There was a problem hiding this comment.
VariableSizedCodec may be enough (at least this way we avoid perpetuating the "legacy" string nomenclature)
I prefer to move away from 'string' - but also OK to keep as-is.
| // STRING VARIABLE-SIZED | ||
| // ============================================================================= | ||
|
|
||
| TEST(TypedBufferValuesTest, StringVariableSized_ReadBack) { |
There was a problem hiding this comment.
It does not make a difference - but we should add at least one test here where the values are not human-readable 'strings' (this will help humans better understand that 'string' tests are just ByteArrays, and not necessarily human-readable strings.)
There was a problem hiding this comment.
Added the test, but actually to the contrary (as already discussed). Strings are the human-readable ones, so added some cases with utf-8 chars.
| using TypedBufferRawBytesFixedSized = ByteBuffer<RawBytesFixedSizedCodec>; | ||
| using TypedBufferRawBytesVariableSized = ByteBuffer<RawBytesVariableSizedCodec>; |
There was a problem hiding this comment.
What is the difference between "String" and "RawBytes" here?
"RawBytes" makes sense for other types (e.g. INTs) where do not want to 'interpret' or 'cast' the bytes into a type. But for byte-array type of data, I'm not sure what the difference is.
There was a problem hiding this comment.
Discussed offline, but for the thread. RawBytes is an unformatted byte sequence. String is a fully fledged string, which interprets multi-byte symbols like utf-8 or zero-terminated formatted or other. Concretely RawBytes return a span over uint8, but StringBuffer returns an actual cpp string (a string_view).
These are types are generic, not used yet on Parquet library. Most likely we'll continue using RawBytes when reading the specific BYTE_ARRAY from Parquet. StringBuffer will go unused for Parquet, but it is still handy if we need it.
sofia-tekdatum
left a comment
There was a problem hiding this comment.
This is really excellent work. Thank you for this!!
No description provided.