Project-specific guidance for AI assistants and contributors. This file
adds to the global development guidelines (TDD, Tidy First,
Conventional Commits, uv/just/ruff/pyright/semgrep, .yaml
extension, ASCII-only diagrams). It does not repeat them — read those
first, then the rules here win on any project-specific point.
Authoritative documents, in precedence order:
docs/spec.md— the specification. Behaviour, contracts, invariants.docs/intent.md— implementation brief: milestones, tech choices, quality bar.- global guidelines — coding standards, commit discipline.
If code and docs/spec.md disagree, the spec wins; propose a spec PR
before changing behaviour.
A Python library giving a neutral, lossless Internal Representation (IR) for image-based table-recognition datasets, plus a registry of codecs that translate between the IR and public dataset formats (PubTabNet, FinTabNet, OTSL, TableFormer, ...). The headline constraint: the core has zero third-party runtime dependencies.
- Zero-dependency core.
src/tablecodec/{ir,_invariants,validate,io, loss}.pyandsrc/tablecodec/codecs/{_base,__init__,_htmltable, pubtabnet,otsl,fintabnet,tableformer,...}.pyimport stdlib only. This is enforced by the semgrep rule.semgrep/rules/core-deps/tablecodec-no-third-party-imports-in-core.yaml. When you add a core module, add it to that rule'spaths.includelist. - Optional features are extras. Two modules are permitted third-party
imports and are excluded from the semgrep core list:
cli.py(click,[cli]extra) andteds.py(apted/lxml,[teds]extra — the TEDS metric, ADR 0011). Neither is imported bytablecodec/__init__, soimport tablecodecmust work on a bare interpreter (thepip install -e .CI job guards this). (loss.pyis stdlib-only — static, data-free analysis over codeclossy_*declarations — so it stays IN the core list and is enforced.) Thetablecodec-doclingbridge (apted/lxml + docling-core) is a separate package underpackages/, not part of this core package at all (ADR 0013). - Streaming, not slurping.
readyields lazily; neverf.read()/f.readlines()a whole dataset. The semgrep rule.semgrep/rules/streaming/tablecodec-no-full-file-read.yamlenforces this inio.pyandcodecs/. SPEC §10 requires constant memory. - IR is immutable.
BBox,GridCell,TableSampleare@dataclass(frozen=True, slots=True), hashable, and safe to send across process boundaries.
just ci = lint type test semgrep semgrep-test docs-check. Everything must
be green before commit. just check is an alias for just ci (the prek
pre-push hook runs it). Specifically:
just lint— ruff check, ruff format --check (config inpyproject.toml), and markdownlint-cli2 over tracked markdown (.markdownlint-cli2.yamlignores the generated tables;docs/adr/is excluded as immutable).just type— pyright strict (pyproject.toml [tool.pyright]). Zero errors.just test— pytest. Benchmarks are markedbenchmarkand excluded by default; run withjust bench.just semgrep— scansrc/with the core meta-rules (.semgrep/rules/).just semgrep-test—semgrep test .semgrep/rules/: verify each rule against its co-located fixture (rule correctness, never against real code).just docs-check— regeneratesdocs/format_support.mdanddocs/loss_matrix.mdand fails if they differ from what's committed.
Run just fmt to auto-fix (ruff + markdownlint --fix), just docs to
regenerate the tables. Git hooks are prek-managed: just hooks installs the
pre-commit + pre-push stages from .pre-commit-config.yaml.
just ci covers the core package only (and runs test/type with
--extra teds so the optional TEDS tests run rather than skip). The in-repo
tablecodec-docling bridge (packages/, ADR 0013) has its own gate
just docling-ci; just ci-all runs both. Touching packages/ → run
just docling-ci (or just ci-all).
Most new work is "add codec X". The established recipe:
- If X is an HTML-token format (PubTabNet-like), reuse
codecs/_htmltable.py:parse_html_table(payload, *, id_field=, drop_bbox=),serialize_html_table(sample, *, id_field=, include_bbox=),sniff_html_table(source, *, require_no_bbox=, require_all_bbox=, require_field=). Do not copy the parser — extend the knobs. - If X is a different token language (like OTSL), write a fresh parser,
but derive the algorithm from the source paper/spec — never copy
upstream reference code verbatim (intent.md §6; e.g. IBM's
otsl.pyis off-limits as a copy source). - Codec class is
@dataclass(frozen=True, slots=True)implementing theCodecProtocol (codecs/_base.py):name,spec_version,media_type(declared as@propertyin the Protocol so frozen dataclass attrs satisfy it),read,write,lossy_read,lossy_write, and asniffdelegate forcodecs.detect. lossy_read/lossy_writeMUST be honest — a round-trip test and theanalyze_lossmatrix depend on them. Auxiliary fields whose loss is "structure-preserving" are exactly{bbox, role, extras}; losing anything else is "lossy".- TDD: write
tests/codecs/test_X.pyfirst (identity, read variants, round-trip, lossy declarations, sniff discrimination), then implement. Add minimal synthetic fixtures undertests/fixtures/X/— no borrowed upstream data. - Register the codec in both doc generators
(
scripts/gen_format_support.py,scripts/gen_loss_matrix.py) and runjust docs. - Add the new core module path to the
paths.includelist of.semgrep/rules/core-deps/tablecodec-no-third-party-imports-in-core.yaml. - Patch-bump the version within 0.0.x (one codec ≈ one patch bump:
pyproject.toml+src/tablecodec/__init__.py), add a CHANGELOG[0.0.N]section, update the compare/tag links.
- Registry isolation in tests. Any test that registers codecs must
bookend with
codecs._snapshot()/codecs._restore(saved)(see the fixture intests/codecs/test_registry.py). Otherwise it leaks into sibling tests. tests/codecs/has no__init__.py. Adding one makes pytest import it as a package namedcodecs, shadowing the stdlibcodecsmodule and breaking collection. Keep it absent.TableSample.__hash__excludesextrason purpose (Mappingis not hashable).__eq__still considers it. Don't "fix" this.- Round-trip safety is tested via
copy.deepcopy, not by importing the stdlibpicklemodule into the test tree (keeps the supply-chain surface clean).deepcopyexercises the same__reduce_ex__protocol the IR must support. - pyright strict is picky. Common fixes: type
field(default_factory= ...)with a named helper (not a baredict/list); narrowjson.loadsresults withcast("dict[str, Any]", x); give inline test payloads an explicitdict[str, Any]annotation. - Lint complexity caps. ruff enforces mccabe (C901) and PLR0911/PLR0913. When a function trips them, extract a helper rather than suppressing.
- CI installs
[dev,cli]. pyright must resolve click to type-checkcli.py; the matrix install includes the cli extra. The separatepip install -e .job verifies the core still installs with no extras.
conformance/ holds an in-repo copy of the SPEC §11 corpus (manifest +
JSON Schema + samples + hand-authored expected-IR). This is a
temporary deviation from SPEC §11 (which mandates a separate
vendor-neutral repo) recorded in
docs/adr/0001-conformance-suite-in-repo-temporarily.md; it must be
extracted before v1.0. Expected-IR files are authored independently of
the codecs so the suite catches read-path regressions.
- Staying in 0.0.x for now (no public PyPI release yet). Each codec is a patch bump.
- The release workflow lives at
.github/workflows/release.yamland fires on av*tag, publishing via PyPI Trusted Publishing (OIDC). It is inert until the PyPI-side setup is done — the runbook is in the gitignoredprivate/PYPI_RELEASE_STEPS.md.
docs/spec.md— spec (source of truth; current contract only).docs/intent.md— brief and the single home for all future/roadmap work (§8 "Future work"); spec §17 and handover point here, not the reverse.docs/handover.md— current session state (read for "where are we"); it references intent §8 for the roadmap rather than duplicating it.docs/adr/— decision history.src/tablecodec/— the library (see "Non-negotiable invariants").teds.pyis the core-external TEDS metric ([teds], ADR 0011).packages/tablecodec-docling/— the docling bridge codec, an in-repo monorepo member with its ownpyproject/src/testsand its own version (temporary; extract before publish, ADR 0013). Run it viajust docling-ci.tests/—test_*.pyat root, codec tests intests/codecs/, fixtures intests/fixtures/<codec>/, hypothesis strategies intests/strategies.py, benchmarks intests/benchmarks/.scripts/gen_*.py— doc generators wired intojust docs.