Bug Description
Two related filename/path handling bugs in Surya's output pipeline cause distinct input files to collide in results.
Problem 1: Multi-dot filenames merge into the same result key
get_name_from_path() in surya/input/load.py uses os.path.basename(path).split(".")[0], which truncates everything after the first dot. Distinct inputs like paper.v1.pdf and paper.v2.pdf both become paper.
In folder mode, the CLI groups predictions in results.json by this truncated name, so pages from different source files are merged under one key and their page numbering continues as if they came from the same document. In single-file mode, CLILoader uses the same truncation for result_path, so different dotted filenames write into the same output directory.
# surya/input/load.py
def get_name_from_path(path):
return os.path.basename(path).split(".")[0]
# "paper.v1.pdf" → "paper"
# "paper.v2.pdf" → "paper" ← collision!
Problem 2: Table-recognition debug images saved under wrong document name
In surya/scripts/table_recognition.py, debug images use name (a leaked loop variable retaining the last document name after for i, name in enumerate(loader.names)) instead of orig_name (the correct per-prediction source name). Multi-file runs mislabel all debug outputs under the last document.
# surya/scripts/table_recognition.py
for i, name in enumerate(loader.names):
...
# After loop, `name` = last document name
orig_name = loader.names[img_idx] # correct name
...
rc_image.save(os.path.join(loader.result_path, f"{name}_page...")) # uses wrong `name`!
Steps to Reproduce
Problem 1:
# Create two files with dotted names
surya_ocr ./docs/ # folder containing paper.v1.pdf and paper.v2.pdf
# results.json merges both under key "paper"
Problem 2:
surya_table_rec ./docs/ --images
# All debug images saved under last document's name regardless of source
Expected Behavior
- Each input file should have a unique result key derived from its full stem (e.g.
paper.v1, paper.v2)
- Debug images should be named after their actual source document
Suggested Fix
- Use
os.path.splitext() or Path(path).stem instead of .split(".")[0]
- Use
orig_name consistently in the debug image save paths
Bug Description
Two related filename/path handling bugs in Surya's output pipeline cause distinct input files to collide in results.
Problem 1: Multi-dot filenames merge into the same result key
get_name_from_path()insurya/input/load.pyusesos.path.basename(path).split(".")[0], which truncates everything after the first dot. Distinct inputs likepaper.v1.pdfandpaper.v2.pdfboth becomepaper.In folder mode, the CLI groups predictions in
results.jsonby this truncated name, so pages from different source files are merged under one key and their page numbering continues as if they came from the same document. In single-file mode,CLILoaderuses the same truncation forresult_path, so different dotted filenames write into the same output directory.Problem 2: Table-recognition debug images saved under wrong document name
In
surya/scripts/table_recognition.py, debug images usename(a leaked loop variable retaining the last document name afterfor i, name in enumerate(loader.names)) instead oforig_name(the correct per-prediction source name). Multi-file runs mislabel all debug outputs under the last document.Steps to Reproduce
Problem 1:
Problem 2:
surya_table_rec ./docs/ --images # All debug images saved under last document's name regardless of sourceExpected Behavior
paper.v1,paper.v2)Suggested Fix
os.path.splitext()orPath(path).steminstead of.split(".")[0]orig_nameconsistently in the debug image save paths