Status: Draft — spec document v0.1.0 (this is the specification's own
version; the tablecodec package is independently versioned in the 0.0.x
series — see the package metadata / CHANGELOG for its current version).
Last updated: 2026-05-29
License of this document: CC BY 4.0
tablecodec is a Python library that provides a neutral Internal Representation (IR) for image-based table-recognition datasets and a registry-based Codec layer that translates between this IR and the fragmented landscape of public dataset formats (PubTabNet, FinTabNet, TableBank, PubTables-1M, OTSL, TableFormer Format, DocTags-tables, and others).
The library has a stdlib-only core. Heavier features (TEDS evaluation, HTML parsing, CLI) are opt-in extras.
As of mid-2026, table-structure datasets are published in incompatible formats:
- HTML-token formats: PubTabNet 1.x / 2.0, FinTabNet (original), TableBank, SciTSR.
- Sequence-language formats: OTSL (IBM, ICDAR 2023), DocTags (IBM Granite-Docling, January 2026).
- Object-detection formats: PubTables-1M (Microsoft).
- Augmented derivatives: TableFormer Format (IBM internal), FinTabNet_OTSL, MUSTARD (SPRINT, March 2025).
Every major training pipeline (PaddleOCR, Docling, MTL-TabNet, UniTable, SPRINT) reinvents preprocessing scripts. There is no shared abstraction. docling-core (the closest existing library) is document-wide, Pydantic-bound, and ships heavy dependencies, making it unsuitable for dataset-only workflows, large-scale streaming validation, or environments where install footprint matters.
tablecodec fills exactly this gap, and only this gap.
- Provide a lossless, neutral IR that can represent every cited format without privileging any.
- Provide a Codec registry that allows third parties to add new formats without modifying the core.
- Stream-friendly I/O: handle JSONL files with 500k+ samples without loading all into memory.
- Self-declared loss analysis: every codec must state what information it loses on read / write.
- Conformance test corpus published as a separate, vendor-neutral artifact.
- Stable API once 1.0 is reached. Breaking changes require a major version bump.
- Zero third-party dependencies in the core package.
- Model inference, training, or any GPU code.
- Image manipulation (no Pillow, OpenCV, numpy in core).
- Full document representation.
tablecodeconly handles tables, not entire pages or documents.docling-coreis the right tool for whole-document workflows. - Dataset download or hosting. Hugging Face Datasets and the official repositories serve that role.
- An opinion on which format is "best." All supported formats are first-class.
| Term | Definition |
|---|---|
| IR | The internal representation defined in §5. |
| Codec | A reader + writer pair for one external format. |
| Sample | One annotated table image with its structural metadata. |
| Profile | A named bundle of validation rules tied to a specific format version (e.g. pubtabnet-2.0.0). |
| Conformance | Verifiable behavioral compliance with this specification, established by passing the published test corpus. |
The IR is intentionally a 2D grid model, not HTML, not OTSL, and not DocTags. The grid model:
- Is the smallest representation that can hold every cited format losslessly.
- Makes structural invariants directly checkable (coverage, span consistency).
- Maps cleanly to and from every token-language format published to date.
All types are immutable, hashable, and defined using dataclasses (frozen=True, slots=True). No Pydantic dependency.
BBox = tuple[int, int, int, int] # (x0, y0, x1, y1), absolute pixels
@dataclass(frozen=True, slots=True)
class GridCell:
row: int # zero-indexed
col: int # zero-indexed
rowspan: int = 1 # must be >= 1
colspan: int = 1 # must be >= 1
tokens: tuple[str, ...] = () # cell content as ordered tokens
bbox: BBox | None = None # absent when content is empty OR format omits it
role: Literal["header", "body"] = "body"
@dataclass(frozen=True, slots=True)
class TableSample:
filename: str
nrows: int # logical row count
ncols: int # logical column count
cells: tuple[GridCell, ...] # ordered top-to-bottom, left-to-right
split: Literal["train", "val", "test"] | None = None
imgid: int | None = None
image_width: int | None = None # source image width in px; None if absent
image_height: int | None = None # source image height in px; None if absent
extras: Mapping[str, object] = field(default_factory=dict)image_width / image_height are sample-level metadata (peers of
filename / imgid), not table content: they describe the source image, not
the grid. They are optional — most token formats omit them — and back the
strict profile's bbox-in-image cross-check (§8). Because no codec carries
them today, they are absent from every codec's loss declaration (§9): a field
no codec reads cannot be dropped, so None round-trips losslessly. A future
codec that populates them from source (e.g. PubTables-1M's PASCAL VOC <size>)
and writes to a format that cannot store them would declare them in
lossy_write.
A TableSample is valid when all of the following hold:
| ID | Invariant |
|---|---|
| I-01 | nrows >= 1 and ncols >= 1. |
| I-02 | For every cell, 0 <= row < nrows and 0 <= col < ncols. |
| I-03 | For every cell, row + rowspan <= nrows and col + colspan <= ncols. |
| I-04 | The union of cell footprints (using rowspan / colspan) exactly covers the nrows × ncols grid. No overlaps, no gaps. |
| I-05 | When bbox is set on a content-bearing cell: bbox[0] < bbox[2] and bbox[1] < bbox[3]. A bbox on an empty cell is a placeholder region and is not geometry-checked. A cell is empty when its tokens, concatenated, contain no non-whitespace character — i.e. "".join(tokens).strip() == "". This covers tokens == (), a lone empty-string token ("",), and whitespace-only tokens (" ",). |
| I-06 | Header cells form a contiguous top-region of the grid (no header rows below body rows). |
| I-07 | tokens may be empty (empty cell), but the tuple itself is never None. |
I-05 guards the geometry of a box that localizes content. An empty
cell localizes nothing, and source datasets routinely assign zero-area
placeholder boxes to empty cells (e.g. SynthTabNet, where ~45% of sampled
tables carry such boxes), so an empty cell's bbox geometry is out of
scope for I-05. Codecs still read and keep the bbox faithfully — it
remains on the IR and is preserved on round-trip; only the geometry check
is skipped for empty cells. (Profiles that require bbox presence —
tableformer, pubtabnet-2.0 — are unaffected: a placeholder box is
still present.)
"Empty" is decided by content, not by token count: a cell whose tokens
concatenate to only whitespace ("".join(tokens).strip() == "") localizes
nothing and is treated as empty. A cell whose tokens contain non-whitespace
characters is content-bearing even if those characters are markup (e.g.
("<sup>", " ", "</sup>")): the core IR does not model HTML semantics, so it
cannot know <sup> carries no glyph, and treating such tokens as content is
the IR-neutral choice. See ADR 0010 (which refines ADR 0007).
The extras field is opaque to validation but must be JSON-serializable for codecs that round-trip through it.
- Cell styling (fonts, colors, borders). HTML attributes beyond structure are lost on import. Codecs may preserve them via
extrasbut the IR does not validate them. - Multi-table documents. One sample = one table.
- Page-level layout. Use
docling-corefor that.
A codec is registered against a stable string name and provides four operations:
class Codec(Protocol):
name: str # registry key, e.g. "pubtabnet-2.0.0"
spec_version: str # version of the source format, not of this library
media_type: str # canonical MIME type, e.g. "application/jsonl"
def read(self, source: IO[str]) -> Iterator[TableSample]: ...
def write(self, samples: Iterable[TableSample], sink: IO[str]) -> None: ...
def lossy_read(self) -> frozenset[str]: ...
"""Field paths within the source format that are dropped on read."""
def lossy_write(self) -> frozenset[str]: ...
"""IR fields that cannot be expressed in this format on write."""readMUST yield samples lazily. No full-file slurp.readparses each record into aTableSampleand MUST raise (with the record offset, see 4) on any record it cannot parse — invalid JSON, unknown tokens, structure/cell-count mismatch, etc.readdoes not evaluate the structural invariants (§5.2). Invariant checking is a separate, opt-in step performed byvalidate(sample, profile)(§8), so the caller chooses the strictness and pays the cost only when wanted, and may still read records that are parseable but invariant-invalid (common in real corpora). See ADR 0008.writeMUST produce output that, when re-read by the same codec, round-trips losslessly except for the fields declared inlossy_write.- Errors during streaming MUST include the source line / record offset.
lossy_readandlossy_writeMUST be honest. CI in this repository enforces this via round-trip tests.
from tablecodec import codecs
codecs.register(MyCodec())
codecs.get("pubtabnet-2.0.0")
codecs.detect(source) # returns name | None by peeking at the first recordThird-party codecs distribute themselves as separate PyPI packages (tablecodec-mycodec) and self-register via the entry-point group tablecodec.codecs.
| Codec name | Source format | Read | Write | Notes |
|---|---|---|---|---|
pubtabnet-1.0.0 |
PubTabNet 1.x (no bbox) | ✓ | ✓ | bbox always None on read |
pubtabnet-2.0.0 |
PubTabNet 2.0.0 | ✓ | ✓ | Reference codec for §11 invariants |
fintabnet |
FinTabNet (original PubTabNet-like) | ✓ | ✓ | |
fintabnet-otsl |
ds4sd/FinTabNet_OTSL (HF) |
✓ | ✓ | Lossy on extras.otsl_raw removal |
otsl-1.0.0 |
OTSL token sequences | ✓ | ✓ | Square-table assumption per spec |
tableformer |
TableFormer Format (IBM internal) | ✓ | ✓ | bbox required on empty cells |
doctags-tables |
DocTags table subset (Granite-Docling) | ✓ | △ | Write is OTSL-equivalent subset only |
pubtables-1m |
Microsoft PubTables-1M (object-det.) | ✓ | ✗ | Read-only: bbox-first, no token order |
tablebank |
TableBank | ✓ | △ | Tokens absent in source |
△ = partial / lossy. Each codec's documentation MUST specify exactly which fields are affected.
Codecs not listed here are out-of-scope for v1.0 but may be added as third-party packages.
Validation is layered. A user explicitly opts into the strictness they need.
| Profile | Enforces |
|---|---|
lenient |
I-01, I-02, I-03, I-05 only. Permits gaps and overlaps (I-04 off). |
default |
All of §5.2 (I-01 through I-07). |
pubtabnet-2.0 |
default plus: every non-empty cell has bbox. |
tableformer |
default plus: every cell, empty or not, has bbox. |
strict |
default plus: cross-check every bbox against the image rectangle. A bbox-free sample needs no image metadata; once any cell carries a bbox, the sample MUST declare image_width/image_height and every bbox must satisfy 0 <= x0 < x1 <= width and 0 <= y0 < y1 <= height (upper bound inclusive). See ADR 0012. |
Profile selection:
from tablecodec import validate, profiles
errors = validate(sample, profile=profiles.PUBTABNET_2_0)Errors are returned as a structured list. Validators never raise on data; they raise only on programmer error (e.g. unknown profile name).
tablecodec provides one feature that no inference-oriented library offers: explicit, automated loss analysis between any two codecs.
from tablecodec import analyze_loss
report = analyze_loss(source="pubtabnet-2.0.0", target="otsl-1.0.0")
# Report contains:
# - source_fields_dropped_on_read
# - ir_fields_unrepresentable_in_target
# - round_trip_classification: "lossless" | "structure-preserving" | "lossy"This is the operational backbone for any migration from one format to another. CI in this repository runs analyze_loss across the full Cartesian product of supported codecs and publishes the result with each release.
| Operation | Guarantee |
|---|---|
| Reading a 568k-sample JSONL | Constant memory (one sample at a time), single-pass. |
| Writing the same | Constant memory. |
| Validation across the dataset | Lazy generator, constant memory. |
analyze_loss |
No data read at all; static analysis of codec metadata. |
convert |
Constant memory; backpressure-safe iterator chain. |
These guarantees are part of the public API contract and are enforced by CI benchmarks.
A separate, vendor-neutral repository (tablecodec/conformance) hosts test fixtures and expected outputs. Any implementation (in any language) claiming tablecodec-compatibility MUST pass this suite.
The suite is structured as:
conformance/
├── samples/
│ ├── pubtabnet-2.0.0/
│ │ ├── 001_simple.jsonl
│ │ ├── 002_complex_spans.jsonl
│ │ └── ...
│ ├── otsl-1.0.0/
│ └── ...
├── expectations/
│ ├── 001_simple.ir.json # expected IR after read
│ └── ...
└── INDEX.json # machine-readable test manifest
Vendors (Docling, PaddleOCR, MMOCR, internal pipelines) are invited to certify their preprocessing against this suite and link the certification badge from their READMEs.
Available with pip install "tablecodec[cli]".
tablecodec validate <file> [--profile NAME] [--codec <codec>] [--json]
tablecodec convert <in> <out> --from <codec> --to <codec> [--dry-run]
tablecodec stats <file> [--codec <codec>] [--json]
tablecodec diff <a> <b> [--codec <codec>]
tablecodec analyze-loss --from <codec> --to <codec>
tablecodec codecs list
--codec selects the reader; when omitted, the codec is auto-detected
from the file (codecs.detect). (--strict is just --profile strict;
parallel conversion is not offered — convert is a constant-memory
single-pass stream.)
All commands stream input and exit non-zero on validation failures, suitable for use in CI / data pipelines.
| Package | Dependencies |
|---|---|
tablecodec |
Python stdlib only. Hard requirement. |
tablecodec[teds] |
apted, lxml (TEDS evaluation wrappers) |
tablecodec[cli] |
click (CLI) |
tablecodec[hf] |
datasets, defusedxml (occasional, local-only e2e harness; not a library runtime dependency) |
tablecodec[all] |
All of the above |
A pull request that introduces a new third-party dependency to the core package MUST be rejected. CI enforces this via an import-graph linter (the semgrep rules under .semgrep/rules/).
The earlier
fast(orjson) andvalidate(pydantic) extras were removed (ADR 0009): the work they would touch — JSONL parsing, IR construction, validation — happens in the zero-dependency core, where third-party imports are forbidden, so the extras could never be wired in. Stricter validation is provided by the layered profiles (§8), which are stdlib-only.
tablecodec follows Semantic Versioning, with the following additional commitments:
- 0.x: API may change. No stability promises.
- 1.0: API frozen for minor releases. Breaking changes require a major bump.
- LTS: Each major version is supported (security and codec compatibility fixes) for at least 3 years from its initial release.
- Format spec drift: When an upstream format changes (e.g. PubTabNet publishes 3.0.0), a new codec name is registered (
pubtabnet-3.0.0). Old codecs are kept until their corresponding upstream format is officially deprecated.
The library version and each codec's spec_version are tracked
independently — the library version is tablecodec.__version__
(printed by tablecodec --version), while each codec carries its own
spec_version attribute and encodes the format version in its registry
name (e.g. pubtabnet-2.0.0). The IR has no separate runtime version
constant: in the 0.x line it evolves with the library, and this document
(see the Status header) is the versioned description of the IR.
tablecodec and docling-core are complementary, not competing.
| Aspect | docling-core |
tablecodec |
|---|---|---|
| Scope | Entire documents (pages, sections, tables, figures) | Tables only |
| IR | DoclingDocument (Pydantic, hierarchical) |
TableSample (dataclass, 2D grid) |
| Dependencies | Pydantic, jsonschema, latex2mathml, typer, ... | Stdlib |
| Use case | Document conversion pipelines | Dataset I/O, validation, format migration |
| Format opinion | DocTags / OTSL native, others as importers | All formats first-class |
A bridge codec (tablecodec-docling) maps between DoclingDocument.tables and TableSample instances (read and write), registering through the tablecodec.codecs entry-point group (§6.2). read yields one TableSample per table; write emits one DoclingDocument per sample (round-trip-capable, modulo a declared write-loss: docling stores one text string per cell, so multi-token segmentation and IR extras are lossy on write). It is being developed in-repo under packages/tablecodec-docling/ as a temporary monorepo member, to be extracted to its own package before publishing (ADR 0013); the dependency on docling-core stays out of the stdlib-only core.
tablecodec is released under the MIT License. Contributions are accepted under the same license.
The library deliberately maintains no dependency on any single vendor's format. Should a single format become the de facto standard, that format will be supported as one codec among equals, not as a privileged IR.
The Conformance Suite repository is a separate, MIT-licensed artifact intended for vendor-neutral certification. Should a foundation (e.g. Linux Foundation AAIF) wish to adopt it, the maintainers will support the donation under preservation of the format-neutrality clause above.
The IR/contract points intentionally left undecided for v1.0 (cell ordering,
in-cell tokenization, float bbox, IR JSON Schema), together with all other
forward-looking work, are tracked in docs/intent.md (Future work) — the
single home for roadmap/未来 items. This spec documents the current contract
only.
- PubTabNet: https://github.com/ibm-aur-nlp/PubTabNet
- PubTabNet on Hugging Face (current mirror): https://huggingface.co/datasets/ajimeno/PubTabNet
- FinTabNet (IBM; the original
/exchanges/data/all/fintabnet/page has been retired — see the IBM Data Exchange index): https://developer.ibm.com/exchanges/data/ - FinTabNet_OTSL (Docling Project): https://huggingface.co/datasets/ds4sd/FinTabNet_OTSL · https://huggingface.co/datasets/docling-project/FinTabNet_OTSL
- PubTables-1M (Microsoft): https://github.com/microsoft/table-transformer
- TableBank: https://github.com/doc-analysis/TableBank
- SciTSR: https://github.com/Academic-Hammer/SciTSR
- MUSTARD (multilingual OTSL): https://github.com/IITB-LEAP-OCR/SPRINT
- Zhong, ShafieiBavani, Yepes. Image-based table recognition: data, model, and evaluation. ECCV 2020. https://arxiv.org/abs/1911.10683
- Nassar et al. TableFormer: Table Structure Understanding with Transformers. CVPR 2022. https://arxiv.org/abs/2203.01017
- Lysak et al. Optimized Table Tokenization for Table Structure Recognition (OTSL). ICDAR 2023. https://arxiv.org/abs/2305.03393
- Smock, Pesala, Abraham. PubTables-1M: Towards Comprehensive Table Extraction. CVPR 2022. https://arxiv.org/abs/2110.00061
- Smock et al. GriTS: Grid table similarity metric. ICDAR 2023.
- Mehta et al. SPRINT: Script-agnostic Structure Recognition in Tables. arXiv 2025-03. https://arxiv.org/abs/2503.11932
- TEDS metric (official): https://github.com/ibm-aur-nlp/PubTabNet/blob/master/src/metric.py
- OTSL reference parser (Docling): https://github.com/docling-project/docling-ibm-models/blob/main/docling_ibm_models/tableformer/otsl.py
docling-core(DoclingDocument): https://github.com/docling-project/docling-core- PaddleOCR PP-Structure table module: https://github.com/PaddlePaddle/PaddleOCR/tree/main/ppstructure/table
- UniTable mini-PubTabNet format: https://github.com/poloclub/unitable
- Docling (project): https://github.com/docling-project/docling
- Linux Foundation Agentic AI Foundation (AAIF)
- Granite-Docling-258M (DocTags origin, January 2026): https://huggingface.co/ibm-granite/granite-docling-258M
- Python
codecsmodule (registry pattern): https://docs.python.org/3/library/codecs.html - JSON Schema Test Suite (conformance pattern): https://github.com/json-schema-org/JSON-Schema-Test-Suite
- Web Platform Tests (multi-vendor conformance): https://github.com/web-platform-tests/wpt