Skip to content

Feature/processing bug fixes#128

Merged
PaperMtn merged 23 commits into
developfrom
feature/processing-bug-fixes
Apr 30, 2026
Merged

Feature/processing bug fixes#128
PaperMtn merged 23 commits into
developfrom
feature/processing-bug-fixes

Conversation

@PaperMtn

Copy link
Copy Markdown
Owner

No description provided.

PaperMtn and others added 23 commits April 28, 2026 14:22
`_multipro_message_worker` and `_multipro_file_worker` previously had no
try/except. Any exception in a worker (rate-limit retry exhausted,
malformed payload, AttributeError on a missing field) killed the child
process, and `join()` returned cleanly, so the parent silently lost
results for that signature/query with zero log output.

Workers now append failures to a shared `errors` Manager list, and
`find_messages` / `find_files` log an `ERROR` line per captured failure
after `join()`. Direct unit-test invocations without an `errors` list
still get the original exception re-raised.

Fixes #108
`_multipro_message_worker` chained `message.get('channel').get('id')`,
where the first `.get` was defensive but the second was not. Search
results for DMs, tombstoned messages, and certain bot/system messages
can come back with `channel` missing or set to None, raising
`AttributeError` and (since #108) silently capturing a worker failure
that should never have happened.

Replaced the chain with `(message.get('channel') or {}).get('id')` so a
missing channel falls through to `c = None` like the existing else
branch already handles.

Fixes #109
`_multipro_file_worker` called `.lower()` directly on `file_dict.get('name')`
and `file_dict.get('filetype')`. Slack files in deleted, redacted, or
tombstoned states return null for these fields, raising AttributeError
which (since #108) silently captures a worker failure that should never
have happened.

Both fields are now coerced once per file via `(file_dict.get(...) or '').lower()`
and reused across the file_types and no-file_types branches. Files with
null/missing name or filetype simply don't match the query rather than
taking down the worker.

Fixes #110
`OUTPUT_LOGGER` was set to '' at the top of `main()` and only replaced
with a real logger after argparse and the `metadata.metadata(...)` call.
Any exception raised in between fell into `except Exception` and called
`''.log('CRITICAL', e)`, raising AttributeError and masking the real
error.

Initialise the placeholder to `None` instead and guard both the
`TimeoutError` and `Exception` handlers, falling back to
`traceback.print_exc()` when the logger has not been constructed yet.

Fixes #111
Each call previously instantiated three separate `multiprocessing.Manager()`
objects — one per shared list (results, potential matches, errors) — each
forking its own subprocess and relying on GC / atexit for cleanup.

Wrap the body of `find_messages` and `find_files` in
`with multiprocessing.Manager() as manager:` and call `manager.list()`
three times. The manager subprocess is now started once per call and
shut down deterministically when the block exits.

The four existing find_messages/find_files tests are reworked through a
small `_mock_manager_lists` helper that wires the context-manager
protocol; two new tests assert the manager is entered and exited
exactly once per call.

Fixes #116
…uest

`SlackClient._make_request` previously slept a fixed 90s on 429 and then
called `self.session.request(...)` directly to retry, bypassing its own
error handling. A second 429 in a row therefore propagated as a raw
HTTPError, and (since #108) was logged at the worker boundary while the
query result was lost.

The 429 branch now reads `Retry-After` from the response (falling back
to 90s when absent or unparseable), then recurses into `_make_request`
so a second 429 receives the same handling. A `_retries` counter caps
the recursion at 5 and raises an informative HTTPError on exhaustion.

Adds `_parse_retry_after` for header parsing plus targeted tests for
the four 429 paths (single retry, default fallback, second 429 still
retried, retries exhausted).

Fixes #115
…ckClient

Replace the per-search-string `multiprocessing.Process` spawn in
`find_messages` / `find_files` with a `multiprocessing.Pool`:

- Pool size is capped at `_DEFAULT_POOL_SIZE` (8) and downsizes for
  signatures with fewer search strings, so signature size no longer
  dictates how many subprocesses run concurrently.
- Each worker constructs its own `SlackClient` once at pool startup via
  `_init_worker_client`. The parent's connection pool / retry config is
  reproduced in each worker process rather than silently dropped during
  pickling.
- The parent's pre-extracted `session_token` and `cookie_dict` are
  threaded through to the worker via new keyword-only kwargs on
  `SlackClient.__init__`, so workers do not redo the
  cookie -> session-token HTTP roundtrip on startup.

Existing worker functions (`_multipro_message_worker`,
`_multipro_file_worker`) keep their signatures; small `_pool_run_*`
wrappers pull the worker-local SlackClient from a module global and
forward the call.

Test changes: existing find_messages/find_files tests now mock Pool
instead of Process. New tests verify pool initargs match the parent
client, pool size is capped at the cap and shrinks for short query
lists, and `_init_worker_client` does not call `_get_session_token`.
SlackClient gains tests for the new kwargs (session_token honoured,
cookie_dict copied defensively, token still wins over session_token).

Fixes #113
Fixes #114
Both worker functions previously hit the Slack API once per matching
message/file for `users.info`, and once per match for
`conversations.info` (message worker only) — the same user or channel
was re-fetched every time it appeared. A worker with 1000 matches that
referenced 5 unique users + 3 unique channels made 2000 API round
trips.

Add a per-worker `dict` cache keyed by user ID (and a separate one for
channel ID in the message worker). Each unique ID is resolved once and
reused for every subsequent match. The cache is local to a single
worker call — no cross-worker IPC, no Manager dict, no shared state.

The file worker now uses a small `_resolve_file_user` helper that
preserves the pre-existing `is_dataclass` defensive check (the
asymmetry between the file_types and no-file_types branches is left
for a separate change).

New tests assert that repeated user/channel IDs across matches result
in a single `get_user_info` / `get_conversation_info` call per ID,
covering both workers and both file-worker branches.

Fixes #117
…gh to None

Both functions are annotated `-> List[Dict]` but had no explicit return on
the no-match and outer-exception paths, so they implicitly returned None.
Existing callers happened to use `if results:` guards so the silent contract
violation never bit, but anything iterating or measuring the result directly
would have crashed.

Add an explicit `return []` on each path and assert the empty-list contract
in unit tests.

Fixes #118
…atching file_type

The per-file_type loop appended a result_dict for every sig.file_types
entry that substring-matched a file's filetype, so a signature with
overlapping file_types (e.g. ['zip', 'ip']) produced N duplicate entries
per matching file. The post-hoc watchman_id dedup collapsed these later,
but the dataclass construction, Manager IPC traffic, and worker-side
memory were all wasted.

Replace the loop with a single `any(...)` check so each matching file
produces exactly one result.

Fixes #119
`_resolve_file_user` short-circuited to None when `file_dict['user']` was
a dataclass instance, but `file_dict` always comes straight from
search.files where 'user' is a raw user-ID string — the dataclass branch
was unreachable. The asymmetry between the file_types and no-file_types
branches that flagged this had already been resolved by routing both
through `_resolve_file_user`; this just drops the leftover guard and the
now-unused `dataclasses` import.

Fixes #120
Missed in the previous commit. Same fix, no code changes.

Refs #120
`_multipro_message_worker` was recompiling each sig.patterns entry on
every iteration of the message loop, and on the matching path it called
Pattern.search a second time to extract .group(0). For N messages and
P patterns this came out to N * P compiles and up to 2 * N * P searches.

Compile each pattern once before the message loop, capture the Match
object once, and reuse it for both the truthy check and the group-0
extraction. Now P compiles and N * P searches.

Fixes #121
…tions

Each entry now carries a `**[Critical]**` / `**[High]**` / `**[Medium]**`
prefix matching the severity in PROCESSING_ISSUES.md, and entries inside
Changed and Fixed are ordered by severity. Format remains Keep a
Changelog-compatible.
…l "None"

`_multipro_message_worker` was doing `str(message.get('text'))`, which
turns missing or null text into the literal string 'None'. A permissive
signature regex could then false-positive match attachment-only
messages, bot/system events, or any other payload Slack returns without
a `text` field — producing a result row with no real text behind it.

Treat empty/missing text as a skip: read message['text'] without
str-coercion and `continue` when it's falsy.

Fixes #122
The unauthenticated workspace-login scrape ran without exception
handling. In probe mode this was caught by the outer try/except (which
exits with CRITICAL anyway), but in the authenticated flow a transient
network blip or a Slack-side HTML/data-props schema change aborted the
entire scan after auth had already succeeded.

Wrap the request and parse logic in try/except for
`requests.RequestException`, `json.JSONDecodeError`, and
`AttributeError`. Return None (matching the existing "no info
available" contract) and emit a WARNING via an optional logger
argument. Both call sites in __init__.py now pass OUTPUT_LOGGER.

Fixes #123
The function was annotated `Dict[str, List[str]] | None` but the actual
payload mixes lists, booleans, strings and None — every value type
except a list shows up. Type checkers (and any caller naive enough to
trust the annotation, e.g. iterating values as lists) were wrong.

Widen to `Dict[str, Any] | None` and add a runtime assertion in the
existing happy-path test that locks in the heterogeneous contract.

Fixes #124
Slack interprets the `after:` operator in the workspace timezone. The
previous `time.strftime('%Y-%m-%d', time.localtime(now - X))` formatted
in the host's TZ, so any host on a different TZ to the workspace got
an off-by-one-day window at the day boundary — typical for CI, cloud
runners, and distributed teams.

Switch to `time.gmtime` and factor the four inline branches in main()
into a small `_compute_timeframe(tm, now)` helper so the TZ behaviour
is unit-testable. Tests cover all four timeframes and trap any future
reintroduction of `time.localtime`.

Fixes #125
`f'{workspace_information.url}canvas/{channel.id}'` produced a
malformed URL like `https://example.slack.comcanvas/C123` whenever
team.info returned a workspace URL without a trailing slash.

Extract a small `_build_canvas_url(workspace_url, channel_id)` helper
that rstrip('/')s the base before concatenation, so both
`https://example.slack.com` and `https://example.slack.com/` resolve
to a single `/` between host and path.

Fixes #126
`cursor = r.get('response_metadata').get('next_cursor')` was set inside
the inner `for value in r.get(scope)` loop. On a normal page that
rewrote `cursor` to the same value N times — wasted but harmless. On a
page where `r.get(scope)` came back empty the inner loop never ran,
so `cursor` was never updated and the outer `while` re-requested the
same cursor on the next iteration: a potential infinite loop on
paginated responses that surface an empty `scope` list with a
non-empty `next_cursor`.

Move the assignment after the inner loop so the cursor advances once
per page regardless of how many items were returned. Add a regression
test that pages through three responses (one with an empty `members`
list) and asserts the cursor seen on each request matches the one
returned by the previous page.
Two small clarity fixes in `__init__.py`:

- `OUTPUT_LOGGER` was annotated `JSONLogger` but `init_logger` returns
  `StdoutLogger | JSONLogger` depending on the `--output` flag. Widen
  the annotation to match the actual runtime type.

- `if everything or not pii and not secrets:` parses correctly under
  Python precedence (`not` > `and` > `or`) but is easy to misread.
  Add explicit parentheses: `if everything or (not pii and not secrets):`.
  No behaviour change.
Five additions from a coverage audit of test_watchman_processor.py:

- Drop dead `@patch('slack_watchman.watchman_processor.SlackClient')`
  decorators from `test_get_users` and `test_get_channels`. Both tests
  passed their own `MagicMock()` to the function under test and never
  used the patched symbol — the decorator was a no-op.

- `test_find_auth_information` now asserts the full set of nine output
  keys, so a silent rename/drop/addition in the `output = {...}` literal
  can't slip past unnoticed.

- `test_find_files_pool_size_matches_short_query_list`: mirrors the
  existing `find_messages` test, locking in that a sub-cap query list
  spawns one worker per query rather than always padding to the cap.

- `test_find_files_pool_initargs_carry_parent_credentials`: mirrors the
  existing `find_messages` test, locking in that the pool initializer
  receives the parent client's `(token, url, session_token, cookie_dict)`
  so file workers don't redo the cookie -> session-token roundtrip.

- `test_multipro_message_worker_falls_back_to_username_when_user_id_missing`:
  parametrised across `user=None`, `user=''`, and `user` key absent.
  Asserts `users.info` is not called and the raw `message['username']`
  is attached to the outgoing message — the bot/system-message fallback
  branch was previously uncovered.
@PaperMtn PaperMtn merged commit 0d38390 into develop Apr 30, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant