Context
The PDF extraction logic does not understand PagerDuty's two-column layout. This leads to artifacts like the report author's name being spliced into parts of the document:
Proposal
I wonder whether we should just author our own reports, and provide some useful constructs for things like timelines, etc., rather than relying on PagerDuty output — I don't find the PagerDuty UI all that useful.
However, shorter term, can can probably parse this in different passes:
- Find the right hand column with
extract_text_lines and locate the Owner of review process marker to determine x0.
- Find the bottom of the multi-column layout with
extract_text_lines and locate Timeline to determine y1
- Parse
(0, 0, x0, y1) as a single column (walk over extract_text_lines scoped to this bounding box, and regex match headings to delineate sections)
- Parse the timeline as-is.
Updates and actions
No response
Context
The PDF extraction logic does not understand PagerDuty's two-column layout. This leads to artifacts like the report author's name being spliced into parts of the document:
Proposal
I wonder whether we should just author our own reports, and provide some useful constructs for things like timelines, etc., rather than relying on PagerDuty output — I don't find the PagerDuty UI all that useful.
However, shorter term, can can probably parse this in different passes:
extract_text_linesand locate theOwner of review processmarker to determinex0.extract_text_linesand locateTimelineto determiney1(0, 0, x0, y1)as a single column (walk overextract_text_linesscoped to this bounding box, and regex match headings to delineate sections)Updates and actions
No response