Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ doc_text_processing/CEO_word_extraction/cleaning_extractions/cleaned/
KPI_analysis/cache/
KPI_analysis/output/

# OCR annotation artifacts
annotation_OCR/sessions/

# VSCode settings
.vscode/settings.json

Expand Down
250 changes: 250 additions & 0 deletions annotation_OCR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
# OCR Annotation Interface

Browser interface for reviewing OCR table extraction quality. The app now
defaults to table-level items extracted from `*_det.mmd`, shows the isolated
HTML table in the extracted-content pane, and auto-centers the raw page image
on the detected table region while still allowing manual zoom-out for more
context.

Annotations are stored under `annotation_OCR/sessions/` so quality labels can
later be joined to downstream benchmark outputs.

## Run

### Headless mode (recommended for multi-user)

Start the server with no session arguments — annotators create/resume sessions
from the browser landing page. If `annotation_OCR/manifests/tables_5000.json`
exists, the server uses it automatically for fast session creation. Otherwise
it falls back to building a sampled table queue directly from the OCR corpus.

```bash
uv run python annotation_OCR/server.py --host 0.0.0.0 --port 5050
```

Then open `http://HOST:5050`. The landing page lets each user enter their name,
create a new session, or resume an existing one. No CLI or Python knowledge
needed on the annotator side.

### Pre-created session (single-user / scripted)

From the repository root:

```bash
uv run python annotation_OCR/server.py \
--session-name "table QA smoke" \
--annotator "your-name" \
--queue-mode tables \
--sample-size 100 \
--host 127.0.0.1 \
--port 5050
```

For a small smoke run:

```bash
uv run python annotation_OCR/server.py \
--session-name smoke \
--annotator test \
--queue-mode tables \
--sample-size 20 \
--limit-reports 2 \
--host 127.0.0.1 \
--port 5050
```

To force the server to use an explicit precomputed manifest:

```bash
uv run python annotation_OCR/server.py \
--manifest-path annotation_OCR/manifests/tables_5000.json \
--host 127.0.0.1 \
--port 5050
```

To use precomputed study-session bundles for a paper annotation round:

```bash
uv run python annotation_OCR/server.py \
--study-bundle annotation_OCR/manifests/study_sessions_15.json \
--host 127.0.0.1 \
--port 5050
```

Each new session created from the landing page then receives the next fixed
session queue from that bundle, so the progress bar tracks a real per-annotator
target rather than the whole table pool.

Resume an existing session:

```bash
uv run python annotation_OCR/server.py --session-id SESSION_ID --host 127.0.0.1 --port 5050
```

SSH port forwarding from a laptop:

```bash
ssh -L 5050:127.0.0.1:5050 USER@SERVER
```

Then open `http://127.0.0.1:5050` locally.

For table sessions, the extracted-content pane shows only the isolated table and
the raw-image pane auto-refocuses on the detected bounding box. Use `Refocus`
or press `F` to jump back to the table after manual exploration.

## Precompute A Reusable 5,000-Table Manifest

Build the reusable subset once offline:

```bash
mkdir -p annotation_OCR/manifests

uv run python annotation_OCR/ocr_index.py \
--queue-mode tables \
--sample-size 5000 \
--seed 42 \
--output annotation_OCR/manifests/tables_5000.json
```

That manifest can then be reused by the server so new annotation sessions do
not need to rescan the OCR corpus.

## Build Study Session Bundles

For hybrid annotation rounds, build one bundle for each possible annotator
count. The generated bundles already keep each session inside the target range
of 120 to 140 items:

```bash
uv run python annotation_OCR/study_sessions.py \
--source-manifest annotation_OCR/manifests/tables_5000.json \
--output-dir annotation_OCR/manifests \
--annotators 14 15 16 \
--seed 42
```

This writes:

- `annotation_OCR/manifests/study_sessions_14.json`
- `annotation_OCR/manifests/study_sessions_15.json`
- `annotation_OCR/manifests/study_sessions_16.json`

The 15- and 16-annotator bundles use 1500 unique tables with 300 triple-coded
agreement tables. The 14-annotator bundle lowers the agreement subset to 220 so
all session quotas still stay within the 120 to 140 target range.
Comment on lines +133 to +135

## Compute Agreement After Annotation

After the study round, compute overlap agreement plus accept/reject ratios with:

```bash
uv run python annotation_OCR/study_agreement.py \
--study-bundle annotation_OCR/manifests/study_sessions_15.json
```

By default this writes analysis artifacts under:

- `annotation_OCR/sessions/study_analysis/study_sessions_15/summary.md`
- `annotation_OCR/sessions/study_analysis/study_sessions_15/summary.json`
- `annotation_OCR/sessions/study_analysis/study_sessions_15/session_metrics.csv`
- `annotation_OCR/sessions/study_analysis/study_sessions_15/item_metrics.csv`

The script auto-discovers sessions created from that bundle via their stored
`study_bundle_path` and `study_slot`. It reports exact agreement, pairwise
agreement, Fleiss' kappa, and accept/reject ratios both at the raw vote level
and at the final table-decision level.

## Data Sources

Defaults:

- OCR Markdown root: `DeepSeekOCR_Ardian_pruned_1k/`
- Raw image root: `/data/workspace/charles/pdf_ocr_deepseek/DeepSeekOCR_Ardian_raw_3kdocs/`
- Default reusable manifest path: `annotation_OCR/manifests/tables_5000.json`

Each queued table item maps back to the raw PNG page with the same zero-based
page index, for example page index `12` maps to `pages/page_0012.png`. Table
items carry the `_det.mmd` bounding box used by the UI to center the preview.
The manifest records mapping warnings such as missing raw images or page-count
mismatches.

## Queue Modes

- `tables`: default. Queues table-level items from `*_det.mmd`. Use `--sample-size` for deterministic random sampling.
- `table-candidates`: legacy page-level mode. Keeps pages with table-like signals, dense numeric rows, financial statement headings, or KPI aliases.
- `all`: legacy page-level mode that queues every page.
- `sample`: legacy seeded random sample across all discovered pages.

Indexer smoke check:

```bash
uv run python annotation_OCR/ocr_index.py \
--ocr-root DeepSeekOCR_Ardian_pruned_1k \
--raw-root /data/workspace/charles/pdf_ocr_deepseek/DeepSeekOCR_Ardian_raw_3kdocs \
--queue-mode tables \
--sample-size 20 \
--limit-reports 2 \
--check
```

## Keyboard

- `a`: mark Yes, save, advance
- `r`: mark No, save, advance
- `u`: mark Uncertain, save, advance
- `j` / right arrow: next page
- `k` / left arrow: previous page
- `+`, `-`, `0`: zoom / reset
- `f`: refocus on the detected table
- `?`: shortcut dialog

Shortcuts are disabled while typing in notes or editing form controls.

## Outputs

Each session writes to `annotation_OCR/sessions/{session_id}/`:

- `metadata.json`: session name, annotator, configuration, counts, timestamps.
- `manifest.json`: queued items and mapping diagnostics.
- `annotations.jsonl`: append-only event log, one saved annotation per line.
- `current_annotations.json`: latest annotation per item, written atomically.
- `summary.csv`: one row per queued item, including unreviewed items.
- `summary.md`: status-count overview.

Regenerate summaries:

```bash
uv run python annotation_OCR/summarize.py --session-id SESSION_ID
uv run python annotation_OCR/summarize.py --all
```

## Annotation Schema

Primary fields:

- `overall_status`: `ok`, `not_ok`, `uncertain`, or `unreviewed`
- `notes`: optional free text

Identity fields include `industry_slug`, `report_name`, `exchange`, `ticker`, `year`, `page_index`, `page_number`, `mmd_path`, `raw_png_path`, and `page_text_sha256`.

For table sessions, summary rows also include `item_kind`, `table_index`,
`table_row_count`, `table_col_count`, `det_mmd_path`, and `focus_bbox`.

## Downstream Joins

For table-level filtering, join annotation summaries on:

```text
exchange, ticker, year, page_index, table_index
```

For report-level benchmark filtering, aggregate page labels to:

```text
exchange, ticker, year
```

A conservative report-level rule is to exclude a report when any reviewed table
item is `not_ok`, or when the share of `uncertain` table items exceeds a
threshold chosen for the benchmark run.
1 change: 1 addition & 0 deletions annotation_OCR/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""OCR annotation interface package."""
46 changes: 46 additions & 0 deletions annotation_OCR/manifests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Table Manifests

Place reusable sampled table manifests here.

Recommended default:

```bash
uv run python annotation_OCR/ocr_index.py \
--queue-mode tables \
--sample-size 5000 \
--seed 42 \
--output annotation_OCR/manifests/tables_5000.json
```

When `tables_5000.json` exists, `annotation_OCR/server.py` will use it by default for new sessions.

## Study Session Bundles

For paper annotation rounds, also build the headcount-specific session bundles:

```bash
uv run python annotation_OCR/study_sessions.py \
--source-manifest annotation_OCR/manifests/tables_5000.json \
--output-dir annotation_OCR/manifests \
--annotators 14 15 16 \
--seed 42
```

This creates:

- `study_sessions_14.json`
- `study_sessions_15.json`
- `study_sessions_16.json`

Use the bundle matching the final annotator count when starting the server:

```bash
uv run python annotation_OCR/server.py \
--study-bundle annotation_OCR/manifests/study_sessions_15.json
```

Why the 14-annotator bundle differs:

- `1500 unique + 300 triple-coded` requires `2100` total annotations.
- That fits 15 or 16 annotators while keeping each session in the `120–140` range.
- For 14 annotators, the bundle uses `220` agreement tables instead, for `1940` total annotations and per-session targets of `138–139`.
Loading