Skip to content

Commit e5ac1b2

Browse files
ENH: Make study cohortes with definite num of annotators
1 parent b641653 commit e5ac1b2

4 files changed

Lines changed: 1236 additions & 18 deletions

File tree

annotation_OCR/README.md

Lines changed: 126 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,22 @@
11
# OCR Annotation Interface
22

3-
Browser interface for comparing raw OCR page images with the corresponding Markdown page extracted by DeepSeekOCR. The app stores page-level annotations under `annotation_OCR/sessions/` so quality labels can later be joined to LLM benchmark outputs.
3+
Browser interface for reviewing OCR table extraction quality. The app now
4+
defaults to table-level items extracted from `*_det.mmd`, shows the isolated
5+
HTML table in the extracted-content pane, and auto-centers the raw page image
6+
on the detected table region while still allowing manual zoom-out for more
7+
context.
8+
9+
Annotations are stored under `annotation_OCR/sessions/` so quality labels can
10+
later be joined to downstream benchmark outputs.
411

512
## Run
613

714
### Headless mode (recommended for multi-user)
815

916
Start the server with no session arguments — annotators create/resume sessions
10-
from the browser landing page:
17+
from the browser landing page. If `annotation_OCR/manifests/tables_5000.json`
18+
exists, the server uses it automatically for fast session creation. Otherwise
19+
it falls back to building a sampled table queue directly from the OCR corpus.
1120

1221
```bash
1322
uv run python annotation_OCR/server.py --host 0.0.0.0 --port 5050
@@ -25,7 +34,8 @@ From the repository root:
2534
uv run python annotation_OCR/server.py \
2635
--session-name "table QA smoke" \
2736
--annotator "your-name" \
28-
--queue-mode table-candidates \
37+
--queue-mode tables \
38+
--sample-size 100 \
2939
--host 127.0.0.1 \
3040
--port 5050
3141
```
@@ -36,13 +46,35 @@ For a small smoke run:
3646
uv run python annotation_OCR/server.py \
3747
--session-name smoke \
3848
--annotator test \
39-
--queue-mode table-candidates \
49+
--queue-mode tables \
50+
--sample-size 20 \
4051
--limit-reports 2 \
41-
--limit 20 \
4252
--host 127.0.0.1 \
4353
--port 5050
4454
```
4555

56+
To force the server to use an explicit precomputed manifest:
57+
58+
```bash
59+
uv run python annotation_OCR/server.py \
60+
--manifest-path annotation_OCR/manifests/tables_5000.json \
61+
--host 127.0.0.1 \
62+
--port 5050
63+
```
64+
65+
To use precomputed study-session bundles for a paper annotation round:
66+
67+
```bash
68+
uv run python annotation_OCR/server.py \
69+
--study-bundle annotation_OCR/manifests/study_sessions_15.json \
70+
--host 127.0.0.1 \
71+
--port 5050
72+
```
73+
74+
Each new session created from the landing page then receives the next fixed
75+
session queue from that bundle, so the progress bar tracks a real per-annotator
76+
target rather than the whole table pool.
77+
4678
Resume an existing session:
4779

4880
```bash
@@ -57,32 +89,102 @@ ssh -L 5050:127.0.0.1:5050 USER@SERVER
5789

5890
Then open `http://127.0.0.1:5050` locally.
5991

60-
The extracted-content pane shows inline OCR images by default. Turn off `Inline images` if you want a lighter placeholder-only Markdown preview.
92+
For table sessions, the extracted-content pane shows only the isolated table and
93+
the raw-image pane auto-refocuses on the detected bounding box. Use `Refocus`
94+
or press `F` to jump back to the table after manual exploration.
95+
96+
## Precompute A Reusable 5,000-Table Manifest
97+
98+
Build the reusable subset once offline:
99+
100+
```bash
101+
mkdir -p annotation_OCR/manifests
102+
103+
uv run python annotation_OCR/ocr_index.py \
104+
--queue-mode tables \
105+
--sample-size 5000 \
106+
--seed 42 \
107+
--output annotation_OCR/manifests/tables_5000.json
108+
```
109+
110+
That manifest can then be reused by the server so new annotation sessions do
111+
not need to rescan the OCR corpus.
112+
113+
## Build Study Session Bundles
114+
115+
For hybrid annotation rounds, build one bundle for each possible annotator
116+
count. The generated bundles already keep each session inside the target range
117+
of 120 to 140 items:
118+
119+
```bash
120+
uv run python annotation_OCR/study_sessions.py \
121+
--source-manifest annotation_OCR/manifests/tables_5000.json \
122+
--output-dir annotation_OCR/manifests \
123+
--annotators 14 15 16 \
124+
--seed 42
125+
```
126+
127+
This writes:
128+
129+
- `annotation_OCR/manifests/study_sessions_14.json`
130+
- `annotation_OCR/manifests/study_sessions_15.json`
131+
- `annotation_OCR/manifests/study_sessions_16.json`
132+
133+
The 15- and 16-annotator bundles use 1500 unique tables with 300 triple-coded
134+
agreement tables. The 14-annotator bundle lowers the agreement subset to 220 so
135+
all session quotas still stay within the 120 to 140 target range.
136+
137+
## Compute Agreement After Annotation
138+
139+
After the study round, compute overlap agreement plus accept/reject ratios with:
140+
141+
```bash
142+
uv run python annotation_OCR/study_agreement.py \
143+
--study-bundle annotation_OCR/manifests/study_sessions_15.json
144+
```
145+
146+
By default this writes analysis artifacts under:
147+
148+
- `annotation_OCR/sessions/study_analysis/study_sessions_15/summary.md`
149+
- `annotation_OCR/sessions/study_analysis/study_sessions_15/summary.json`
150+
- `annotation_OCR/sessions/study_analysis/study_sessions_15/session_metrics.csv`
151+
- `annotation_OCR/sessions/study_analysis/study_sessions_15/item_metrics.csv`
152+
153+
The script auto-discovers sessions created from that bundle via their stored
154+
`study_bundle_path` and `study_slot`. It reports exact agreement, pairwise
155+
agreement, Fleiss' kappa, and accept/reject ratios both at the raw vote level
156+
and at the final table-decision level.
61157

62158
## Data Sources
63159

64160
Defaults:
65161

66162
- OCR Markdown root: `DeepSeekOCR_Ardian_pruned_1k/`
67163
- Raw image root: `/data/workspace/charles/pdf_ocr_deepseek/DeepSeekOCR_Ardian_raw_3kdocs/`
164+
- Default reusable manifest path: `annotation_OCR/manifests/tables_5000.json`
68165

69-
Each queued item maps one `.mmd` page split to the raw PNG with the same zero-based page index, for example page index `12` maps to `pages/page_0012.png`. The manifest records mapping warnings such as missing raw images or page-count mismatches.
166+
Each queued table item maps back to the raw PNG page with the same zero-based
167+
page index, for example page index `12` maps to `pages/page_0012.png`. Table
168+
items carry the `_det.mmd` bounding box used by the UI to center the preview.
169+
The manifest records mapping warnings such as missing raw images or page-count
170+
mismatches.
70171

71172
## Queue Modes
72173

73-
- `table-candidates`: default. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
74-
- `all`: queues every page.
75-
- `sample`: seeded random sample across all discovered pages. Use `--sample-size` and `--seed`.
174+
- `tables`: default. Queues table-level items from `*_det.mmd`. Use `--sample-size` for deterministic random sampling.
175+
- `table-candidates`: legacy page-level mode. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
176+
- `all`: legacy page-level mode that queues every page.
177+
- `sample`: legacy seeded random sample across all discovered pages.
76178

77179
Indexer smoke check:
78180

79181
```bash
80182
uv run python annotation_OCR/ocr_index.py \
81183
--ocr-root DeepSeekOCR_Ardian_pruned_1k \
82184
--raw-root /data/workspace/charles/pdf_ocr_deepseek/DeepSeekOCR_Ardian_raw_3kdocs \
83-
--queue-mode table-candidates \
185+
--queue-mode tables \
186+
--sample-size 20 \
84187
--limit-reports 2 \
85-
--limit 20 \
86188
--check
87189
```
88190

@@ -93,7 +195,8 @@ uv run python annotation_OCR/ocr_index.py \
93195
- `u`: mark Uncertain, save, advance
94196
- `j` / right arrow: next page
95197
- `k` / left arrow: previous page
96-
- `+`, `-`, `0`: zoom controls
198+
- `+`, `-`, `0`: zoom / reset
199+
- `f`: refocus on the detected table
97200
- `?`: shortcut dialog
98201

99202
Shortcuts are disabled while typing in notes or editing form controls.
@@ -103,10 +206,10 @@ Shortcuts are disabled while typing in notes or editing form controls.
103206
Each session writes to `annotation_OCR/sessions/{session_id}/`:
104207

105208
- `metadata.json`: session name, annotator, configuration, counts, timestamps.
106-
- `manifest.json`: queued pages and mapping diagnostics.
209+
- `manifest.json`: queued items and mapping diagnostics.
107210
- `annotations.jsonl`: append-only event log, one saved annotation per line.
108211
- `current_annotations.json`: latest annotation per item, written atomically.
109-
- `summary.csv`: one row per queued page, including unreviewed pages.
212+
- `summary.csv`: one row per queued item, including unreviewed items.
110213
- `summary.md`: status-count overview.
111214

112215
Regenerate summaries:
@@ -125,12 +228,15 @@ Primary fields:
125228

126229
Identity fields include `industry_slug`, `report_name`, `exchange`, `ticker`, `year`, `page_index`, `page_number`, `mmd_path`, `raw_png_path`, and `page_text_sha256`.
127230

231+
For table sessions, summary rows also include `item_kind`, `table_index`,
232+
`table_row_count`, `table_col_count`, `det_mmd_path`, and `focus_bbox`.
233+
128234
## Downstream Joins
129235

130-
For page-level filtering, join annotation summaries on:
236+
For table-level filtering, join annotation summaries on:
131237

132238
```text
133-
exchange, ticker, year, page_index
239+
exchange, ticker, year, page_index, table_index
134240
```
135241

136242
For report-level benchmark filtering, aggregate page labels to:
@@ -139,4 +245,6 @@ For report-level benchmark filtering, aggregate page labels to:
139245
exchange, ticker, year
140246
```
141247

142-
A conservative report-level rule is to exclude a report when any reviewed table-candidate page is `not_ok`, or when the share of `uncertain` pages exceeds a threshold chosen for the benchmark run.
248+
A conservative report-level rule is to exclude a report when any reviewed table
249+
item is `not_ok`, or when the share of `uncertain` table items exceeds a
250+
threshold chosen for the benchmark run.

annotation_OCR/manifests/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Table Manifests
2+
3+
Place reusable sampled table manifests here.
4+
5+
Recommended default:
6+
7+
```bash
8+
uv run python annotation_OCR/ocr_index.py \
9+
--queue-mode tables \
10+
--sample-size 5000 \
11+
--seed 42 \
12+
--output annotation_OCR/manifests/tables_5000.json
13+
```
14+
15+
When `tables_5000.json` exists, `annotation_OCR/server.py` will use it by default for new sessions.
16+
17+
## Study Session Bundles
18+
19+
For paper annotation rounds, also build the headcount-specific session bundles:
20+
21+
```bash
22+
uv run python annotation_OCR/study_sessions.py \
23+
--source-manifest annotation_OCR/manifests/tables_5000.json \
24+
--output-dir annotation_OCR/manifests \
25+
--annotators 14 15 16 \
26+
--seed 42
27+
```
28+
29+
This creates:
30+
31+
- `study_sessions_14.json`
32+
- `study_sessions_15.json`
33+
- `study_sessions_16.json`
34+
35+
Use the bundle matching the final annotator count when starting the server:
36+
37+
```bash
38+
uv run python annotation_OCR/server.py \
39+
--study-bundle annotation_OCR/manifests/study_sessions_15.json
40+
```
41+
42+
Why the 14-annotator bundle differs:
43+
44+
- `1500 unique + 300 triple-coded` requires `2100` total annotations.
45+
- That fits 15 or 16 annotators while keeping each session in the `120–140` range.
46+
- For 14 annotators, the bundle uses `220` agreement tables instead, for `1940` total annotations and per-session targets of `138–139`.

0 commit comments

Comments
 (0)