[PR] prevent/reduce layout oscillation#584
[PR] prevent/reduce layout oscillation#584stk-code wants to merge 3 commits intofunstory-ai:mainfrom
Conversation
There was a problem hiding this comment.
1 issue found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="babeldoc/format/pdf/document_il/midend/paragraph_finder.py">
<violation number="1" location="babeldoc/format/pdf/document_il/midend/paragraph_finder.py:438">
P2: Layout-change hysteresis lacks an upper bound on rightward x-jump, allowing distant same-line regions to be merged into one paragraph.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
Please upload the original PDF used to reproduce this issue. Also, the merge cycle for this PR may be long or it might be treated as a proof of concept. This is because testing the logic involved here is quite difficult, and I need to do my best to rule out any potential regressions. |
|
I have been focusing my energy on refactoring the pdf parser and pdf generation recently; once this is complete, I will address the improvements for the internal translation phase. |
|
The test-PDF is linked in the PR above. FYI: I've been also working on hyphenation and justification of hyphenated texts. It looks quite good. But I am writing with the PR. |
PR Title
[PR]
Motivation and Context
paragraph_findergroups characters into paragraphs partly based on their detected layout ID.Problem: In some PDFs, adjacent characters from the same visual text line are assigned to two
different layout regions in an alternating pattern, for example layout 10, then 16, then 10, then 16.
Because _group_characters_into_paragraphs() treated a layout change as a hard paragraph boundary, this caused the text to be split into many tiny paragraphs, sometimes even one character per paragraph.
This means: Alternating layout assignments for neighboring characters caused false paragraph breaks.
Summary of Changes
The fix introduces a kind of hysteresis into paragraph grouping. The the current paragraph state is more “sticky”:
We check
The new logic checks whether:
PR Type
Breaking Changes
Should contain none - unless there exists documents in which single characters alternate between two layout regions and belong to a new paragraph each.
Contributor Checklist
[ x] I have fully read and understood the CONTRIBUTING.md guide.
I have performed a self-review of my own code.
My changes follow the project's code style and guidelines (
[n/a] I have linked the related issue(s) in the description above (if applicable)
[n/a ] I have updated relevant documentation (if applicable)
I have added necessary tests that prove my fix is effective or that my feature works (if applicable)
test.pdf
All new and existing tests passed locally with my changes
My changes generate no new warnings or errors
I understand that due to limited maintainer resources, only small PRs are accepted. Suggestions with proof-of-concept patches are appreciated, and my patch may be rewritten if necessary.
Testing Instructions
Translating the attached pdf resulted in broken paragraphs, after the fix the pdf is correct.

(translated from english to en-gb :-) )
Summary by cubic
Reduce false paragraph breaks by making layout-change splits “sticky” in PDF paragraph grouping. Adds a symmetric horizontal bound and stronger same-line checks to stop tiny, one-character paragraphs from alternating layout IDs.
_should_split_on_layout_changeto ignore harmless layout flips when characters share the same XObject, have strong vertical overlap (viacalculate_y_iou_for_boxes), both layouts are text layouts, and horizontal movement stays within a symmetric tolerance based on character width._group_characters_into_paragraphsto only split on structural changes; still starts new paragraphs for XObject changes and bullets.Written for commit 5210d08. Summary will update on new commits.