Skip to content
This repository was archived by the owner on May 1, 2026. It is now read-only.
This repository was archived by the owner on May 1, 2026. It is now read-only.

CompressedSet: deferred from v0.1, scope + reopening criteria #7

Description

@justinjoy

Status

CompressedSet from upstream segmentio/ksuid (set.go, 343 LOC, 17 functions) is deliberately not ported in libksuid v0.1. This is a scope decision, not a technical limitation -- the port is straightforward but the trade-off did not justify the surface in the first release.

What CompressedSet is

Upstream Go feature for packing many KSUIDs into a small byte slice. Sorted KSUIDs share leading bytes (timestamp + high payload), and consecutive payload values often differ by 1; the format exploits that with a 2-bit tag byte plus a varint delta:

Tag Meaning Approximate cost
rawKSUID (0b00) first / restart entry 20 bytes
timeDelta (0b01) new timestamp + raw payload 1 + N + 16 bytes
payloadDelta (0b10) same timestamp, payload += delta128 1 + N bytes (typically 2-5)
payloadRange (0b11) same timestamp, M consecutive +1s 1 + N bytes for M ids

Source: /home/joykim/git/semantic-reasoning/ksuid/set.go:67-244. The varint helpers (varintLength32/64/128, appendVarint32/64/128, varint32/64/128) account for ~140 of the 343 lines.

Realistic compression on time-clustered streams is roughly 3-5 bytes / KSUID after the first, vs. 20 bytes raw -- a 4-6x win that matters when you are persisting millions of IDs to disk or to a network wire.

Why v0.1 left it out

[Phase 1 Critic review] explicitly recommended cutting it:

"set.go packs varint deltas, time deltas, and range/single discriminators. Any off-by-one in varint reading or wrong byte-tag enum (rawKSUID, timeDelta, payloadDelta, ...) silently produces garbage KSUIDs on iteration. Half-baked port = wire incompatibility nobody notices for months."

The decision rests on three things, in order:

  1. Silent-corruption risk. A wrong tag mask, a one-byte varint length error, or an off-by-one in the range-length scan all produce decoded KSUIDs that look fine until they collide or sort wrong. Detecting that in CI requires a corpus of upstream-Go-encoded blobs and a differential test; we do not have that infrastructure yet.
  2. Niche utility. The 80% of users who call ksuid_new / ksuid_parse / ksuid_format do not care. CompressedSet only earns its keep at scale (>=10^4 IDs in one place, e.g. a database snapshot or queue dump).
  3. Footprint. ~250 LOC across encode + decode + iterator + varint helpers, plus the tag/varint differential tests. That is a sizeable fraction of libksuid's current ~18 KB stripped binary.

Reopening criteria

Land CompressedSet when any of the following lands first:

  • a concrete downstream consumer requests it on this issue tracker;
  • libksuid grows a benchmark / fixtures corpus that captures Go-generated Compress outputs and we can pin C decode to byte-for-byte parity (the corpus is the missing piece, not the algorithm);
  • libksuid is already proposing wire compatibility with another ksuid implementation that uses the same packed format.

Until then the answer is "use upstream Go ksuid for compressed-set workloads, or pack/unpack at the application layer."

Implementation sketch (when we do port)

  1. New TU libksuid/set.c + private header libksuid/set.h (or public if exposed).
  2. Public surface: opaque ksuid_set_t builder + iterator, plus ksuid_set_compress(const ksuid_t *ids, size_t n, uint8_t **out, size_t *out_len) and matching iterator (ksuid_set_iter_init, ksuid_set_iter_next).
  3. Reuse libksuid/uint128.h (currently removed -- restore alongside) for the 128-bit payload-delta arithmetic.
  4. Tests:
    • varint round-trip for every byte length 1..16;
    • tag enum exhaustive: every (current_state, next_state) pair drives the right encode tag;
    • Differential corpus -- Go program emits 10^4 KSUIDs with varied (timestamp, payload-delta) clusters, libksuid decodes byte-for-byte;
    • 65 KSUID payloadRange boundary (just-fits vs just-overflows the varint length).
  5. Meson option -Dcompressed_set=true (default false) until the differential corpus is in CI; flip default after one stable release.

Affected files / references

  • Upstream: set.go:1-343 (entire file)
  • Architect plan (Phase 1, persona analysis): proposed libksuid/set.c as a separate TU
  • Critic plan (Phase 1): "Cut entirely from v1. Header ksuid_set.h stub returning KSUID_ERR_UNSUPPORTED."
  • Related: KSUID_ERR_UNSUPPORTED in libksuid/ksuid.h is not yet defined -- if a stub is desired before a full port, it should land in a small follow-up commit.

Labels (if/when configured)

enhancement, wire-format, feature-parity, not-blocking-v1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions