SHAPE5 is a machine-readable specification of the HDF5 file format using GNU poke pickles.
The goal is to describe HDF5 on-disk structures as executable binary format definitions that can be loaded in GNU poke to inspect, validate, and reason about HDF5 files.
This repository is a work in progress. The current pickles focus on core HDF5 metadata structures, including the superblock, object headers, and related messages.
pickles/common.pk: shared helpers and common definitionspickles/superblock.pk: HDF5 superblock definitionspickles/ohdr.pk: object header definitionspickles/messages.pk: object header message definitionspickles/construct.pk: helpers for constructing version 2 metadata in memorypickles/lookup3.pk: implementation of the lookup3 hash function used for checksums
This short tutorial shows how to explore the sample HDF5 file file.h5 from the poke REPL using the pickles in this repository. It assumes GNU poke is installed and that you start from the repository root.
Start poke with the repository pickles/ directory on the load path:
cd <THIS DIRECTORY>
POKE_LOAD_PATH=$PWD/pickles poke file.h5At the (poke) prompt, load the pickles needed for the superblock and object headers:
load common
load superblock
load ohdr
load lookup3
These commands do not print anything on success; poke simply returns to the prompt.
The HDF5 superblock begins at byte offset 0:
var sb = superblock @ 0#B
sb.super_vers
var root_addr = bytes_to_off (sb.super.v2_v3.root_obj_addr_raw)
root_addr
Expected output:
(poke) sb.super_vers
2UB
(poke) root_addr
48UL#B
This tells us that file.h5 uses a version 2 superblock and that the root object header starts at byte offset 48.
var root = ohdr @ root_addr
root
Expected output snippet:
ohdr {sig_peek=[79UB,72UB,68UB,82UB],_ohdr=struct {v2=struct {signature=[79UB,72UB,68UB,82UB],version=2UB,flags=32UB,timestamps=Timestamps {access=1773447782U,modification=1773447782U,change=1773447782U,birth=1773447782U},chunk0_size=[120UB],_msg_chunk=struct {msg_chunk=[2UB,18UB,0UB,0UB,0UB,0UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB,10UB,2UB,0UB,1UB,0UB,0UB,6UB,26UB,0UB,0UB,1UB,0UB,15UB,68UB,105UB,114UB,101UB,99UB,116UB,67UB,104UB,117UB,110UB,107UB,68UB,97UB,116UB,97UB,195UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,58UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB]},chksum=[7UB,68UB,33UB,252UB]}}}
We are looking at a version 2 object header. Unlike the earlier version, it comes with a checksum. You can verify the checksum with the lookup3_hashlittle function from lookup3.pk:
lookup3_u32_le(root._ohdr.v2.chksum)
Expected output:
4230038535U
Let's calculate the checksum ourselves to see how it works. The checksum is computed over the entire object header (including the prefix) except for the checksum field ( 4 bytes) itself, which is located at the end of the header.
lookup3_hashlittle(byte[root'size as offset<uint<64>,B> - 4UL#B] @ root_addr, 0)
Expected output:
4230038535U
Phew! This confirms that the checksum is correct and that we understand how to compute it.
Print the root object header's decoded messages:
root.get_messages ()
Expected output snippet:
Message 0...
msg_prefix {
v2_msg_prefix=struct {
msg_type=2UB,
msg_size=18UH,
msg_flags=0UB
}
}
H5O_msg_linfo {
version=0UB,
flags=0UB,
fheap_addr_raw=[255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB],
name_bt2_addr_raw=[255UB,255UB,255UB,255UB,255UB,255UB,255UB,255UB]
}
Message 2...
msg_prefix {
v2_msg_prefix=struct {
msg_type=6UB,
msg_size=26UH,
msg_flags=0UB
}
}
H5O_msg_link {
version=1UB,
flags=0UB,
lnk_name=[68UB,105UB,114UB,101UB,99UB,116UB,67UB,104UB,117UB,110UB,107UB,68UB,97UB,116UB,97UB],
ohdr_addr_raw=[195UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB]
}
The interesting part here is the link message: the byte array in lnk_name is the ASCII string DirectChunkData, and ohdr_addr_raw points to the child object header at byte offset 195.
Now map that child object header and decode its messages:
var dset = ohdr @ 195#B
dset.get_messages ()
Expected output snippet:
Message 0...
H5O_msg_sdspace {
version=2UB,
space=struct {
v2=struct {
ndims=2UB,
flags=1UB,
space_type=1UB,
dim_size=[8UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB],
max=[8UB,0UB,0UB,0UB,0UB,0UB,0UB,0UB]
}
}
}
Message 1...
H5O_msg_dtype {
hdr=dtype_hdr {
flags=2064U,
elm_size=4U
},
types=struct {
fixed_point=struct {
bit_offset=0UH,
bit_precision=32UH
}
}
}
Message 4...
H5O_msg_layout {
version=3UB,
layout=struct {
v3=struct {
layout_class=2UB,
properties=struct {
chunked=struct {
ndims=3UB,
idx_addr_raw=[223UB,1UB,0UB,0UB,0UB,0UB,0UB,0UB],
dim_size=[4U,4U,4U]
}
}
}
}
}
This shows the sort of machine-readable structure the pickles expose: dataspace, datatype, filter pipeline, and layout information are all decoded directly from the file.
You can also ask poke what each pickle defines:
.info type superblock
.info type ohdr
This is useful when extending the pickles or when you want to discover methods such as get_messages () directly from the REPL.
Poke maps are writable. To avoid modifying the sample file in the repository, make a copy first:
cp file.h5 file-edit.h5
POKE_LOAD_PATH=$PWD/pickles poke file-edit.h5Then edit a scalar field through the mapped object header:
load common
load ohdr
var root = ohdr @ 48#B
root._ohdr.v2.timestamps.birth
root._ohdr.v2.timestamps.birth = 0U
root._ohdr.v2.timestamps.birth
Expected output:
(poke) root._ohdr.v2.timestamps.birth
1773447782U
(poke) root._ohdr.v2.timestamps.birth = 0U
(poke) root._ohdr.v2.timestamps.birth
0U
This demonstrates byte-level write-through via the mapped pickle types. It does not automatically update higher-level HDF5 consistency metadata, so for real edits you may also need to recompute dependent fields such as checksums.
We can also build a minimal HDF5 file from scratch: a version 2 superblock followed by a version 2 root object header for the root group. This time we construct the metadata in a memory-backed IOS first, and only save it to disk at the end.
Start poke from the repository root without opening a file yet:
cd <THIS DIRECTORY>
POKE_LOAD_PATH=$PWD/pickles pokeAt the (poke) prompt, load the helper pickle and create a fresh memory IOS:
load construct
load lookup3
.mem image
First construct the version 2 superblock value. The root object header will start at offset 48#B, and the final image size will be 179#B:
fun undef_addr = uint<8>[8]: { return uint<8>[8] (255); }
var sb = superblock_v2 { sizeof_offsets = 8UB, sizeof_lengths = 8UB, ext_addr_raw = undef_addr, eof_addr_raw = u64_to_bytes_le (179UL, 8), root_obj_addr_raw = u64_to_bytes_le (48UL, 8) }
Now stage the root-group messages in the memory IOS at offset 1024#B. That offset is arbitrary; we just use it as scratch space while building the object header chunk. The memory IOS starts zero-filled, so the 88 data bytes of the NIL message do not need any explicit initialization.
msg_prefix_v2 @ 1024#B = msg_prefix_v2 { msg_type = 2UB, msg_size = 18UH, msg_flags = 0UB }
H5O_msg_linfo @ 1028#B = H5O_msg_linfo { version = 0UB, flags = 0UB, fheap_addr_raw = undef_addr, name_bt2_addr_raw = undef_addr }
msg_prefix_v2 @ 1046#B = msg_prefix_v2 { msg_type = 10UB, msg_size = 2UH, msg_flags = 1UB }
H5O_msg_ginfo @ 1050#B = H5O_msg_ginfo { version = 0UB, flags = 0UB }
msg_prefix_v2 @ 1052#B = msg_prefix_v2 { msg_type = 0UB, msg_size = 88UH, msg_flags = 0UB }
var root = ohdr_v2 { flags = 0UB, chunk0_size = [120UB], msg_chunk = byte[120] @ 1024#B }
Serialize the typed values into the first 179 bytes of the memory IOS, compute the checksums, and save the result to disk:
superblock_v2 @ 0#B = sb
var sb_map = superblock_v2 @ 0#B
sb_map.chksum = lookup3_hashlittle(byte[44] @ 0#B, 0)
ohdr_v2 @ 48#B = root
var root_map = ohdr_v2 @ 48#B
root_map.chksum = lookup3_hashlittle(byte[127] @ 48#B, 0)
save :file "empty.h5" :size 179#B
Finally, map the image back using the parser pickles and verify it:
var sb2 = superblock @ 0#B
var root2 = ohdr @ 48#B
sb2.super_vers
bytes_to_off (sb2.super.v2_v3.root_obj_addr_raw)
lookup3_hashlittle(byte[44] @ 0#B, 0)
lookup3_u32_le(root2._ohdr.v2.chksum)
lookup3_hashlittle(byte[root2'size as offset<uint<64>,B> - 4UL#B] @ 48#B, 0)
root2.get_messages ()
Expected output snippet:
(poke) sb2.super_vers
2UB
(poke) bytes_to_off (sb2.super.v2_v3.root_obj_addr_raw)
48UL#B
(poke) lookup3_hashlittle(byte[44] @ 0#B, 0)
673867655U
(poke) lookup3_u32_le(root2._ohdr.v2.chksum)
2898835909U
(poke) lookup3_hashlittle(byte[root2'size as offset<uint<64>,B> - 4UL#B] @ 48#B, 0)
2898835909U
Message 0...
H5O_msg_linfo { ... }
Message 1...
H5O_msg_ginfo {
version=0UB,
flags=0UB
}
Message 2...
H5O_msg_nil {
}
At this point empty.h5 is a valid HDF5 file containing only the root group. For an external check, h5dump -pBH empty.h5 reports SUPERBLOCK_VERSION 2 and GROUP "/" {}.
This material is based upon work supported by the U.S. National Science Foundation under Federal Award No. 2534078. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.