Skip to content

Multi-dot filenames merge into the same result key; table-recognition debug images use wrong document name #496

@YizukiAme

Description

@YizukiAme

Bug Description

Two related filename/path handling bugs in Surya's output pipeline cause distinct input files to collide in results.

Problem 1: Multi-dot filenames merge into the same result key

get_name_from_path() in surya/input/load.py uses os.path.basename(path).split(".")[0], which truncates everything after the first dot. Distinct inputs like paper.v1.pdf and paper.v2.pdf both become paper.

In folder mode, the CLI groups predictions in results.json by this truncated name, so pages from different source files are merged under one key and their page numbering continues as if they came from the same document. In single-file mode, CLILoader uses the same truncation for result_path, so different dotted filenames write into the same output directory.

# surya/input/load.py
def get_name_from_path(path):
    return os.path.basename(path).split(".")[0]
# "paper.v1.pdf" → "paper"
# "paper.v2.pdf" → "paper"  ← collision!

Problem 2: Table-recognition debug images saved under wrong document name

In surya/scripts/table_recognition.py, debug images use name (a leaked loop variable retaining the last document name after for i, name in enumerate(loader.names)) instead of orig_name (the correct per-prediction source name). Multi-file runs mislabel all debug outputs under the last document.

# surya/scripts/table_recognition.py
for i, name in enumerate(loader.names):
    ...
# After loop, `name` = last document name

orig_name = loader.names[img_idx]  # correct name
...
rc_image.save(os.path.join(loader.result_path, f"{name}_page..."))  # uses wrong `name`!

Steps to Reproduce

Problem 1:

# Create two files with dotted names
surya_ocr ./docs/  # folder containing paper.v1.pdf and paper.v2.pdf
# results.json merges both under key "paper"

Problem 2:

surya_table_rec ./docs/ --images
# All debug images saved under last document's name regardless of source

Expected Behavior

  1. Each input file should have a unique result key derived from its full stem (e.g. paper.v1, paper.v2)
  2. Debug images should be named after their actual source document

Suggested Fix

  1. Use os.path.splitext() or Path(path).stem instead of .split(".")[0]
  2. Use orig_name consistently in the debug image save paths

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions