You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Browser interface for comparing raw OCR page images with the corresponding Markdown page extracted by DeepSeekOCR. The app stores page-level annotations under `annotation_OCR/sessions/` so quality labels can later be joined to LLM benchmark outputs.
3
+
Browser interface for reviewing OCR table extraction quality. The app now
4
+
defaults to table-level items extracted from `*_det.mmd`, shows the isolated
5
+
HTML table in the extracted-content pane, and auto-centers the raw page image
6
+
on the detected table region while still allowing manual zoom-out for more
7
+
context.
8
+
9
+
Annotations are stored under `annotation_OCR/sessions/` so quality labels can
10
+
later be joined to downstream benchmark outputs.
4
11
5
12
## Run
6
13
7
14
### Headless mode (recommended for multi-user)
8
15
9
16
Start the server with no session arguments — annotators create/resume sessions
10
-
from the browser landing page:
17
+
from the browser landing page. If `annotation_OCR/manifests/tables_5000.json`
18
+
exists, the server uses it automatically for fast session creation. Otherwise
19
+
it falls back to building a sampled table queue directly from the OCR corpus.
11
20
12
21
```bash
13
22
uv run python annotation_OCR/server.py --host 0.0.0.0 --port 5050
@@ -25,7 +34,8 @@ From the repository root:
25
34
uv run python annotation_OCR/server.py \
26
35
--session-name "table QA smoke" \
27
36
--annotator "your-name" \
28
-
--queue-mode table-candidates \
37
+
--queue-mode tables \
38
+
--sample-size 100 \
29
39
--host 127.0.0.1 \
30
40
--port 5050
31
41
```
@@ -36,13 +46,35 @@ For a small smoke run:
36
46
uv run python annotation_OCR/server.py \
37
47
--session-name smoke \
38
48
--annotator test \
39
-
--queue-mode table-candidates \
49
+
--queue-mode tables \
50
+
--sample-size 20 \
40
51
--limit-reports 2 \
41
-
--limit 20 \
42
52
--host 127.0.0.1 \
43
53
--port 5050
44
54
```
45
55
56
+
To force the server to use an explicit precomputed manifest:
Each queued item maps one `.mmd` page split to the raw PNG with the same zero-based page index, for example page index `12` maps to `pages/page_0012.png`. The manifest records mapping warnings such as missing raw images or page-count mismatches.
166
+
Each queued table item maps back to the raw PNG page with the same zero-based
167
+
page index, for example page index `12` maps to `pages/page_0012.png`. Table
168
+
items carry the `_det.mmd` bounding box used by the UI to center the preview.
169
+
The manifest records mapping warnings such as missing raw images or page-count
170
+
mismatches.
70
171
71
172
## Queue Modes
72
173
73
-
-`table-candidates`: default. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
74
-
-`all`: queues every page.
75
-
-`sample`: seeded random sample across all discovered pages. Use `--sample-size` and `--seed`.
174
+
-`tables`: default. Queues table-level items from `*_det.mmd`. Use `--sample-size` for deterministic random sampling.
175
+
-`table-candidates`: legacy page-level mode. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
176
+
-`all`: legacy page-level mode that queues every page.
177
+
-`sample`: legacy seeded random sample across all discovered pages.
-`manifest.json`: queued pages and mapping diagnostics.
209
+
-`manifest.json`: queued items and mapping diagnostics.
107
210
-`annotations.jsonl`: append-only event log, one saved annotation per line.
108
211
-`current_annotations.json`: latest annotation per item, written atomically.
109
-
-`summary.csv`: one row per queued page, including unreviewed pages.
212
+
-`summary.csv`: one row per queued item, including unreviewed items.
110
213
-`summary.md`: status-count overview.
111
214
112
215
Regenerate summaries:
@@ -125,12 +228,15 @@ Primary fields:
125
228
126
229
Identity fields include `industry_slug`, `report_name`, `exchange`, `ticker`, `year`, `page_index`, `page_number`, `mmd_path`, `raw_png_path`, and `page_text_sha256`.
127
230
231
+
For table sessions, summary rows also include `item_kind`, `table_index`,
232
+
`table_row_count`, `table_col_count`, `det_mmd_path`, and `focus_bbox`.
233
+
128
234
## Downstream Joins
129
235
130
-
For page-level filtering, join annotation summaries on:
236
+
For table-level filtering, join annotation summaries on:
131
237
132
238
```text
133
-
exchange, ticker, year, page_index
239
+
exchange, ticker, year, page_index, table_index
134
240
```
135
241
136
242
For report-level benchmark filtering, aggregate page labels to:
@@ -139,4 +245,6 @@ For report-level benchmark filtering, aggregate page labels to:
139
245
exchange, ticker, year
140
246
```
141
247
142
-
A conservative report-level rule is to exclude a report when any reviewed table-candidate page is `not_ok`, or when the share of `uncertain` pages exceeds a threshold chosen for the benchmark run.
248
+
A conservative report-level rule is to exclude a report when any reviewed table
249
+
item is `not_ok`, or when the share of `uncertain` table items exceeds a
0 commit comments