artefactory
diff --git a/‎annotation_OCR/README.md‎
Lines changed: 126 additions & 18 deletions b/‎annotation_OCR/README.md‎
Lines changed: 126 additions & 18 deletions
diff --git a/‎annotation_OCR/manifests/README.md‎
Lines changed: 46 additions & 0 deletions b/‎annotation_OCR/manifests/README.md‎
Lines changed: 46 additions & 0 deletions
@@ -1,13 +1,22 @@
 # OCR Annotation Interface
 
-Browser interface for comparing raw OCR page images with the corresponding Markdown page extracted by DeepSeekOCR. The app stores page-level annotations under `annotation_OCR/sessions/` so quality labels can later be joined to LLM benchmark outputs.
+Browser interface for reviewing OCR table extraction quality. The app now
+defaults to table-level items extracted from `*_det.mmd`, shows the isolated
+HTML table in the extracted-content pane, and auto-centers the raw page image
+on the detected table region while still allowing manual zoom-out for more
+context.
+
+Annotations are stored under `annotation_OCR/sessions/` so quality labels can
+later be joined to downstream benchmark outputs.
 
 ## Run
 
 ### Headless mode (recommended for multi-user)
 
 Start the server with no session arguments — annotators create/resume sessions
-from the browser landing page:
+from the browser landing page. If `annotation_OCR/manifests/tables_5000.json`
+exists, the server uses it automatically for fast session creation. Otherwise
+it falls back to building a sampled table queue directly from the OCR corpus.
 
 ```bash
 uv run python annotation_OCR/server.py --host 0.0.0.0 --port 5050
@@ -25,7 +34,8 @@ From the repository root:
 uv run python annotation_OCR/server.py \
   --session-name "table QA smoke" \
   --annotator "your-name" \
-  --queue-mode table-candidates \
+  --queue-mode tables \
+  --sample-size 100 \
   --host 127.0.0.1 \
   --port 5050
 ```
@@ -36,13 +46,35 @@ For a small smoke run:
 uv run python annotation_OCR/server.py \
   --session-name smoke \
   --annotator test \
-  --queue-mode table-candidates \
+  --queue-mode tables \
+  --sample-size 20 \
   --limit-reports 2 \
-  --limit 20 \
   --host 127.0.0.1 \
   --port 5050
 ```
 
+To force the server to use an explicit precomputed manifest:
+
+```bash
+uv run python annotation_OCR/server.py \
+  --manifest-path annotation_OCR/manifests/tables_5000.json \
+  --host 127.0.0.1 \
+  --port 5050
+```
+
+To use precomputed study-session bundles for a paper annotation round:
+
+```bash
+uv run python annotation_OCR/server.py \
+  --study-bundle annotation_OCR/manifests/study_sessions_15.json \
+  --host 127.0.0.1 \
+  --port 5050
+```
+
+Each new session created from the landing page then receives the next fixed
+session queue from that bundle, so the progress bar tracks a real per-annotator
+target rather than the whole table pool.
+
 Resume an existing session:
 
 ```bash
@@ -57,32 +89,102 @@ ssh -L 5050:127.0.0.1:5050 USER@SERVER
 
 Then open `http://127.0.0.1:5050` locally.
 
-The extracted-content pane shows inline OCR images by default. Turn off `Inline images` if you want a lighter placeholder-only Markdown preview.
+For table sessions, the extracted-content pane shows only the isolated table and
+the raw-image pane auto-refocuses on the detected bounding box. Use `Refocus`
+or press `F` to jump back to the table after manual exploration.
+
+## Precompute A Reusable 5,000-Table Manifest
+
+Build the reusable subset once offline:
+
+```bash
+mkdir -p annotation_OCR/manifests
+
+uv run python annotation_OCR/ocr_index.py \
+  --queue-mode tables \
+  --sample-size 5000 \
+  --seed 42 \
+  --output annotation_OCR/manifests/tables_5000.json
+```
+
+That manifest can then be reused by the server so new annotation sessions do
+not need to rescan the OCR corpus.
+
+## Build Study Session Bundles
+
+For hybrid annotation rounds, build one bundle for each possible annotator
+count. The generated bundles already keep each session inside the target range
+of 120 to 140 items:
+
+```bash
+uv run python annotation_OCR/study_sessions.py \
+  --source-manifest annotation_OCR/manifests/tables_5000.json \
+  --output-dir annotation_OCR/manifests \
+  --annotators 14 15 16 \
+  --seed 42
+```
+
+This writes:
+
+- `annotation_OCR/manifests/study_sessions_14.json`
+- `annotation_OCR/manifests/study_sessions_15.json`
+- `annotation_OCR/manifests/study_sessions_16.json`
+
+The 15- and 16-annotator bundles use 1500 unique tables with 300 triple-coded
+agreement tables. The 14-annotator bundle lowers the agreement subset to 220 so
+all session quotas still stay within the 120 to 140 target range.
+
+## Compute Agreement After Annotation
+
+After the study round, compute overlap agreement plus accept/reject ratios with:
+
+```bash
+uv run python annotation_OCR/study_agreement.py \
+  --study-bundle annotation_OCR/manifests/study_sessions_15.json
+```
+
+By default this writes analysis artifacts under:
+
+- `annotation_OCR/sessions/study_analysis/study_sessions_15/summary.md`
+- `annotation_OCR/sessions/study_analysis/study_sessions_15/summary.json`
+- `annotation_OCR/sessions/study_analysis/study_sessions_15/session_metrics.csv`
+- `annotation_OCR/sessions/study_analysis/study_sessions_15/item_metrics.csv`
+
+The script auto-discovers sessions created from that bundle via their stored
+`study_bundle_path` and `study_slot`. It reports exact agreement, pairwise
+agreement, Fleiss' kappa, and accept/reject ratios both at the raw vote level
+and at the final table-decision level.
 
 ## Data Sources
 
 Defaults:
 
 - OCR Markdown root: `DeepSeekOCR_Ardian_pruned_1k/`
 - Raw image root: `/data/workspace/charles/pdf_ocr_deepseek/DeepSeekOCR_Ardian_raw_3kdocs/`
+- Default reusable manifest path: `annotation_OCR/manifests/tables_5000.json`
 
-Each queued item maps one `.mmd` page split to the raw PNG with the same zero-based page index, for example page index `12` maps to `pages/page_0012.png`. The manifest records mapping warnings such as missing raw images or page-count mismatches.
+Each queued table item maps back to the raw PNG page with the same zero-based
+page index, for example page index `12` maps to `pages/page_0012.png`. Table
+items carry the `_det.mmd` bounding box used by the UI to center the preview.
+The manifest records mapping warnings such as missing raw images or page-count
+mismatches.
 
 ## Queue Modes
 
-- `table-candidates`: default. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
-- `all`: queues every page.
-- `sample`: seeded random sample across all discovered pages. Use `--sample-size` and `--seed`.
+- `tables`: default. Queues table-level items from `*_det.mmd`. Use `--sample-size` for deterministic random sampling.
+- `table-candidates`: legacy page-level mode. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
+- `all`: legacy page-level mode that queues every page.
+- `sample`: legacy seeded random sample across all discovered pages.
 
 Indexer smoke check:
 
 ```bash
 uv run python annotation_OCR/ocr_index.py \
   --ocr-root DeepSeekOCR_Ardian_pruned_1k \
   --raw-root /data/workspace/charles/pdf_ocr_deepseek/DeepSeekOCR_Ardian_raw_3kdocs \
-  --queue-mode table-candidates \
+  --queue-mode tables \
+  --sample-size 20 \
   --limit-reports 2 \
-  --limit 20 \
   --check
 ```
 
@@ -93,7 +195,8 @@ uv run python annotation_OCR/ocr_index.py \
 - `u`: mark Uncertain, save, advance
 - `j` / right arrow: next page
 - `k` / left arrow: previous page
-- `+`, `-`, `0`: zoom controls
+- `+`, `-`, `0`: zoom / reset
+- `f`: refocus on the detected table
 - `?`: shortcut dialog
 
 Shortcuts are disabled while typing in notes or editing form controls.
@@ -103,10 +206,10 @@ Shortcuts are disabled while typing in notes or editing form controls.
 Each session writes to `annotation_OCR/sessions/{session_id}/`:
 
 - `metadata.json`: session name, annotator, configuration, counts, timestamps.
-- `manifest.json`: queued pages and mapping diagnostics.
+- `manifest.json`: queued items and mapping diagnostics.
 - `annotations.jsonl`: append-only event log, one saved annotation per line.
 - `current_annotations.json`: latest annotation per item, written atomically.
-- `summary.csv`: one row per queued page, including unreviewed pages.
+- `summary.csv`: one row per queued item, including unreviewed items.
 - `summary.md`: status-count overview.
 
 Regenerate summaries:
@@ -125,12 +228,15 @@ Primary fields:
 
 Identity fields include `industry_slug`, `report_name`, `exchange`, `ticker`, `year`, `page_index`, `page_number`, `mmd_path`, `raw_png_path`, and `page_text_sha256`.
 
+For table sessions, summary rows also include `item_kind`, `table_index`,
+`table_row_count`, `table_col_count`, `det_mmd_path`, and `focus_bbox`.
+
 ## Downstream Joins
 
-For page-level filtering, join annotation summaries on:
+For table-level filtering, join annotation summaries on:
 
 ```text
-exchange, ticker, year, page_index
+exchange, ticker, year, page_index, table_index
 ```
 
 For report-level benchmark filtering, aggregate page labels to:
@@ -139,4 +245,6 @@ For report-level benchmark filtering, aggregate page labels to:
 exchange, ticker, year
 ```
 
-A conservative report-level rule is to exclude a report when any reviewed table-candidate page is `not_ok`, or when the share of `uncertain` pages exceeds a threshold chosen for the benchmark run.
+A conservative report-level rule is to exclude a report when any reviewed table
+item is `not_ok`, or when the share of `uncertain` table items exceeds a
+threshold chosen for the benchmark run.
@@ -0,0 +1,46 @@
+# Table Manifests
+
+Place reusable sampled table manifests here.
+
+Recommended default:
+
+```bash
+uv run python annotation_OCR/ocr_index.py \
+  --queue-mode tables \
+  --sample-size 5000 \
+  --seed 42 \
+  --output annotation_OCR/manifests/tables_5000.json
+```
+
+When `tables_5000.json` exists, `annotation_OCR/server.py` will use it by default for new sessions.
+
+## Study Session Bundles
+
+For paper annotation rounds, also build the headcount-specific session bundles:
+
+```bash
+uv run python annotation_OCR/study_sessions.py \
+  --source-manifest annotation_OCR/manifests/tables_5000.json \
+  --output-dir annotation_OCR/manifests \
+  --annotators 14 15 16 \
+  --seed 42
+```
+
+This creates:
+
+- `study_sessions_14.json`
+- `study_sessions_15.json`
+- `study_sessions_16.json`
+
+Use the bundle matching the final annotator count when starting the server:
+
+```bash
+uv run python annotation_OCR/server.py \
+  --study-bundle annotation_OCR/manifests/study_sessions_15.json
+```
+
+Why the 14-annotator bundle differs:
+
+- `1500 unique + 300 triple-coded` requires `2100` total annotations.
+- That fits 15 or 16 annotators while keeping each session in the `120–140` range.
+- For 14 annotators, the bundle uses `220` agreement tables instead, for `1940` total annotations and per-session targets of `138–139`.