Skip to content

Implement GEFF v1 spec compatibility#5

Merged
ksugar merged 27 commits into
mainfrom
v1-spec-compat-impl
Jun 23, 2026
Merged

Implement GEFF v1 spec compatibility#5
ksugar merged 27 commits into
mainfrom
v1-spec-compat-impl

Conversation

@ksugar

@ksugar ksugar commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Implements full GEFF v1 spec compliance, including node_props_metadata /
    edge_props_metadata, varlength properties, and the updated polygon storage path.
  • Adds PropMetadata and VarlengthProperty classes to support the v1 data model.
  • Migrates polygon storage from the legacy serialized_props/polygon/ layout to
    nodes/props/polygon/ as a varlength property; backward-compatible read fallback is
    retained for files written by earlier versions.
  • Replaces the fixed DEFAULT_CHUNK_SIZE = 1000 with computeFirstDimChunk(), which
    targets ~8 MiB per chunk (power-of-two on the first dimension), matching the Python
    reference implementation.
  • Writes Zarr arrays in little-endian byte order so Python / pandas can read the output
    on little-endian systems without a "Big-endian buffer not supported" error; also
    decompresses Blosc chunks before byte-swapping when compression is active.
  • Falls back to RawCompression when the native c-blosc library is absent.
  • Adds arbitrary props support (getProp / setProp / getProps) and varlength props
    support (getVarlengthProperty / setVarlengthProperty / getVarlengthProperties)
    to both GeffNode and GeffEdge; props are round-tripped through Zarr automatically.
  • Updates README to clarify Zarr Format 2-only support, corrects CITATION.cff metadata,
    and moves internal planning docs to doc/.

Validation

  • All 42 unit tests pass (mvn test).
  • Cross-language round-trip tests pass (cd cross-language-tests && uv run run_tests.py).
  • Files written by this branch are readable by the Python geff reference
    implementation (geff.is_geff_dataset() returns True).
  • Files written by the Python reference implementation are readable by this branch
    (nodes, edges, polygon, and varlength props are restored correctly).
  • Backward-compatible read of files with legacy serialized_props/polygon/ layout.

cmalinmayor and others added 27 commits December 3, 2025 22:08
- GeffNode: assign DEFAULT_COVARIANCE_3D (not 2D) to covariance3d in readFromN5
- GeffNode: fix off-by-one in polygon slice boundary check (< -> <=)
- GeffUtils.readVarlengthProperty: cast missing array to byte[] instead of
  boolean[], since it is stored as UINT8; convert to boolean[] explicitly
- GeffUtils.readVarlengthProperty: pass property name (from PropMetadata
  identifier or last path segment) instead of the full propPath to
  VarlengthProperty constructor
- GeffUtils.writeOffsetsArray: fix column-major stride so the flat layout
  matches what FlattenedInts.at() expects (j + numColumns*i, not i + numNodes*j)
readFromN5 read covariance2d and covariance3d from disk but discarded the
values, always falling back to defaults. Use the read FlattenedDoubles
instead.

Add testCovariance2dRoundTrip and testCovariance3dRoundTrip to cover
write → read for both fields with non-default values.
Creates a GEFF with Python, injects covariance2d (N×4) and covariance3d
(N×6) arrays plus their node_props_metadata entries, runs the Java
round-trip, and verifies the values are preserved within floating-point
tolerance.
- writeOffsetsArray: use UINT64 (was INT64) for varlength values array
- writeMissingArray: patch .zarray dtype to |b1 (bool) after writing UINT8
- writeVarlengthProperty: add declaredDtype param so data array uses the
  dtype declared in metadata (e.g. uint64 stays uint64, not int64)
- GeffNode.writeToN5: remove unsupported props (string dtype) from metadata
  after writing nodes so validate_structure does not find missing prop groups
- RoundTripGeff: write metadata after nodes/edges so all metadata
  modifications (removals, varlength additions) are captured

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

When GeffNode/GeffEdge write all properties (because no nodePropsMetadata /
edgePropsMetadata is provided), the metadata fields were never populated,
causing Python Pydantic validation to fail since node_props_metadata and
edge_props_metadata are required fields.

- GeffNode.writeToN5: register standard node props (t, x, y, z, color,
  track_id, radius, covariance2d, covariance3d) in metadata when writeAllProps
- GeffEdge.writeToN5: register distance and score in metadata when writeAllProps
- GeffMetadata.writeToN5: always write both metadata maps (defaults to {})
  so the required Pydantic fields are always present in .zattrs
…tibility

N5ZarrWriter serializes numeric arrays in big-endian format (">i4", ">f8"),
which causes a "Big-endian buffer not supported on little-endian compiler"
error in pandas (e.g. drop_duplicates on edge ids).

GeffUtils.patchZarrLittleEndian: after writing, walks all .zarray files under
the given group path, byte-swaps chunk file data in place, and updates the
dtype from ">" to "<". Only processes uncompressed (null compressor) arrays
since byte-swapping compressed data requires decompression first.

Also create edges/props group unconditionally so the zarr structure is valid
even when there are no edge properties (Python geff always writes this group).

All 42 Java tests and all 5 cross-language round-trip tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
N5ZarrWriter's createDataset() is a no-op when the dataset already
exists, so the blosc compressor entry in .zarray was never cleared.
Subsequent chunk writes therefore still used blosc, and byte-swapping
blosc-compressed bytes produced garbage data.

Fix uses a three-step approach for compressed arrays:
1. Read decompressed data via a fresh N5ZarrWriter (blosc still active)
2. Directly patch .zarray to set "compressor": null before any new write
3. Open a second fresh writer (which now sees compressor:null) to write
   the raw big-endian chunks, then byte-swap to little-endian as before

This unblocks the "Big-endian buffer not supported" pandas error when
intracktive reads Java-exported GEFF files in production environments
where blosc compression is available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ksugar ksugar requested a review from tinevez June 23, 2026 07:44
@ksugar ksugar merged commit 76e3f99 into main Jun 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants