Skip to content

Fix PDF scraper #18

@agoose77

Description

@agoose77

Context

The PDF extraction logic does not understand PagerDuty's two-column layout. This leads to artifacts like the report author's name being spliced into parts of the document:

Image

Proposal

I wonder whether we should just author our own reports, and provide some useful constructs for things like timelines, etc., rather than relying on PagerDuty output — I don't find the PagerDuty UI all that useful.

However, shorter term, can can probably parse this in different passes:

  1. Find the right hand column with extract_text_lines and locate the Owner of review process marker to determine x0.
  2. Find the bottom of the multi-column layout with extract_text_lines and locate Timeline to determine y1
  3. Parse (0, 0, x0, y1) as a single column (walk over extract_text_lines scoped to this bounding box, and regex match headings to delineate sections)
  4. Parse the timeline as-is.

Updates and actions

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions