Feature/processing bug fixes by PaperMtn · Pull Request #128 · PaperMtn/slack-watchman

PaperMtn · 2026-04-30T06:53:56Z

No description provided.

Release/4.5.0

`_multipro_message_worker` and `_multipro_file_worker` previously had no try/except. Any exception in a worker (rate-limit retry exhausted, malformed payload, AttributeError on a missing field) killed the child process, and `join()` returned cleanly, so the parent silently lost results for that signature/query with zero log output. Workers now append failures to a shared `errors` Manager list, and `find_messages` / `find_files` log an `ERROR` line per captured failure after `join()`. Direct unit-test invocations without an `errors` list still get the original exception re-raised. Fixes #108

`_multipro_message_worker` chained `message.get('channel').get('id')`, where the first `.get` was defensive but the second was not. Search results for DMs, tombstoned messages, and certain bot/system messages can come back with `channel` missing or set to None, raising `AttributeError` and (since #108) silently capturing a worker failure that should never have happened. Replaced the chain with `(message.get('channel') or {}).get('id')` so a missing channel falls through to `c = None` like the existing else branch already handles. Fixes #109

`_multipro_file_worker` called `.lower()` directly on `file_dict.get('name')` and `file_dict.get('filetype')`. Slack files in deleted, redacted, or tombstoned states return null for these fields, raising AttributeError which (since #108) silently captures a worker failure that should never have happened. Both fields are now coerced once per file via `(file_dict.get(...) or '').lower()` and reused across the file_types and no-file_types branches. Files with null/missing name or filetype simply don't match the query rather than taking down the worker. Fixes #110

`OUTPUT_LOGGER` was set to '' at the top of `main()` and only replaced with a real logger after argparse and the `metadata.metadata(...)` call. Any exception raised in between fell into `except Exception` and called `''.log('CRITICAL', e)`, raising AttributeError and masking the real error. Initialise the placeholder to `None` instead and guard both the `TimeoutError` and `Exception` handlers, falling back to `traceback.print_exc()` when the logger has not been constructed yet. Fixes #111

Each call previously instantiated three separate `multiprocessing.Manager()` objects — one per shared list (results, potential matches, errors) — each forking its own subprocess and relying on GC / atexit for cleanup. Wrap the body of `find_messages` and `find_files` in `with multiprocessing.Manager() as manager:` and call `manager.list()` three times. The manager subprocess is now started once per call and shut down deterministically when the block exits. The four existing find_messages/find_files tests are reworked through a small `_mock_manager_lists` helper that wires the context-manager protocol; two new tests assert the manager is entered and exited exactly once per call. Fixes #116

…uest `SlackClient._make_request` previously slept a fixed 90s on 429 and then called `self.session.request(...)` directly to retry, bypassing its own error handling. A second 429 in a row therefore propagated as a raw HTTPError, and (since #108) was logged at the worker boundary while the query result was lost. The 429 branch now reads `Retry-After` from the response (falling back to 90s when absent or unparseable), then recurses into `_make_request` so a second 429 receives the same handling. A `_retries` counter caps the recursion at 5 and raises an informative HTTPError on exhaustion. Adds `_parse_retry_after` for header parsing plus targeted tests for the four 429 paths (single retry, default fallback, second 429 still retried, retries exhausted). Fixes #115

…ckClient Replace the per-search-string `multiprocessing.Process` spawn in `find_messages` / `find_files` with a `multiprocessing.Pool`: - Pool size is capped at `_DEFAULT_POOL_SIZE` (8) and downsizes for signatures with fewer search strings, so signature size no longer dictates how many subprocesses run concurrently. - Each worker constructs its own `SlackClient` once at pool startup via `_init_worker_client`. The parent's connection pool / retry config is reproduced in each worker process rather than silently dropped during pickling. - The parent's pre-extracted `session_token` and `cookie_dict` are threaded through to the worker via new keyword-only kwargs on `SlackClient.__init__`, so workers do not redo the cookie -> session-token HTTP roundtrip on startup. Existing worker functions (`_multipro_message_worker`, `_multipro_file_worker`) keep their signatures; small `_pool_run_*` wrappers pull the worker-local SlackClient from a module global and forward the call. Test changes: existing find_messages/find_files tests now mock Pool instead of Process. New tests verify pool initargs match the parent client, pool size is capped at the cap and shrinks for short query lists, and `_init_worker_client` does not call `_get_session_token`. SlackClient gains tests for the new kwargs (session_token honoured, cookie_dict copied defensively, token still wins over session_token). Fixes #113 Fixes #114

Both worker functions previously hit the Slack API once per matching message/file for `users.info`, and once per match for `conversations.info` (message worker only) — the same user or channel was re-fetched every time it appeared. A worker with 1000 matches that referenced 5 unique users + 3 unique channels made 2000 API round trips. Add a per-worker `dict` cache keyed by user ID (and a separate one for channel ID in the message worker). Each unique ID is resolved once and reused for every subsequent match. The cache is local to a single worker call — no cross-worker IPC, no Manager dict, no shared state. The file worker now uses a small `_resolve_file_user` helper that preserves the pre-existing `is_dataclass` defensive check (the asymmetry between the file_types and no-file_types branches is left for a separate change). New tests assert that repeated user/channel IDs across matches result in a single `get_user_info` / `get_conversation_info` call per ID, covering both workers and both file-worker branches. Fixes #117

…gh to None Both functions are annotated `-> List[Dict]` but had no explicit return on the no-match and outer-exception paths, so they implicitly returned None. Existing callers happened to use `if results:` guards so the silent contract violation never bit, but anything iterating or measuring the result directly would have crashed. Add an explicit `return []` on each path and assert the empty-list contract in unit tests. Fixes #118

…atching file_type The per-file_type loop appended a result_dict for every sig.file_types entry that substring-matched a file's filetype, so a signature with overlapping file_types (e.g. ['zip', 'ip']) produced N duplicate entries per matching file. The post-hoc watchman_id dedup collapsed these later, but the dataclass construction, Manager IPC traffic, and worker-side memory were all wasted. Replace the loop with a single `any(...)` check so each matching file produces exactly one result. Fixes #119

`_resolve_file_user` short-circuited to None when `file_dict['user']` was a dataclass instance, but `file_dict` always comes straight from search.files where 'user' is a raw user-ID string — the dataclass branch was unreachable. The asymmetry between the file_types and no-file_types branches that flagged this had already been resolved by routing both through `_resolve_file_user`; this just drops the leftover guard and the now-unused `dataclasses` import. Fixes #120

Missed in the previous commit. Same fix, no code changes. Refs #120

`_multipro_message_worker` was recompiling each sig.patterns entry on every iteration of the message loop, and on the matching path it called Pattern.search a second time to extract .group(0). For N messages and P patterns this came out to N * P compiles and up to 2 * N * P searches. Compile each pattern once before the message loop, capture the Match object once, and reuse it for both the truthy check and the group-0 extraction. Now P compiles and N * P searches. Fixes #121

…tions Each entry now carries a `**[Critical]**` / `**[High]**` / `**[Medium]**` prefix matching the severity in PROCESSING_ISSUES.md, and entries inside Changed and Fixed are ordered by severity. Format remains Keep a Changelog-compatible.

…l "None" `_multipro_message_worker` was doing `str(message.get('text'))`, which turns missing or null text into the literal string 'None'. A permissive signature regex could then false-positive match attachment-only messages, bot/system events, or any other payload Slack returns without a `text` field — producing a result row with no real text behind it. Treat empty/missing text as a skip: read message['text'] without str-coercion and `continue` when it's falsy. Fixes #122

The unauthenticated workspace-login scrape ran without exception handling. In probe mode this was caught by the outer try/except (which exits with CRITICAL anyway), but in the authenticated flow a transient network blip or a Slack-side HTML/data-props schema change aborted the entire scan after auth had already succeeded. Wrap the request and parse logic in try/except for `requests.RequestException`, `json.JSONDecodeError`, and `AttributeError`. Return None (matching the existing "no info available" contract) and emit a WARNING via an optional logger argument. Both call sites in __init__.py now pass OUTPUT_LOGGER. Fixes #123

The function was annotated `Dict[str, List[str]] | None` but the actual payload mixes lists, booleans, strings and None — every value type except a list shows up. Type checkers (and any caller naive enough to trust the annotation, e.g. iterating values as lists) were wrong. Widen to `Dict[str, Any] | None` and add a runtime assertion in the existing happy-path test that locks in the heterogeneous contract. Fixes #124

Slack interprets the `after:` operator in the workspace timezone. The previous `time.strftime('%Y-%m-%d', time.localtime(now - X))` formatted in the host's TZ, so any host on a different TZ to the workspace got an off-by-one-day window at the day boundary — typical for CI, cloud runners, and distributed teams. Switch to `time.gmtime` and factor the four inline branches in main() into a small `_compute_timeframe(tm, now)` helper so the TZ behaviour is unit-testable. Tests cover all four timeframes and trap any future reintroduction of `time.localtime`. Fixes #125

`f'{workspace_information.url}canvas/{channel.id}'` produced a malformed URL like `https://example.slack.comcanvas/C123` whenever team.info returned a workspace URL without a trailing slash. Extract a small `_build_canvas_url(workspace_url, channel_id)` helper that rstrip('/')s the base before concatenation, so both `https://example.slack.com` and `https://example.slack.com/` resolve to a single `/` between host and path. Fixes #126

`cursor = r.get('response_metadata').get('next_cursor')` was set inside the inner `for value in r.get(scope)` loop. On a normal page that rewrote `cursor` to the same value N times — wasted but harmless. On a page where `r.get(scope)` came back empty the inner loop never ran, so `cursor` was never updated and the outer `while` re-requested the same cursor on the next iteration: a potential infinite loop on paginated responses that surface an empty `scope` list with a non-empty `next_cursor`. Move the assignment after the inner loop so the cursor advances once per page regardless of how many items were returned. Add a regression test that pages through three responses (one with an empty `members` list) and asserts the cursor seen on each request matches the one returned by the previous page.

Two small clarity fixes in `__init__.py`: - `OUTPUT_LOGGER` was annotated `JSONLogger` but `init_logger` returns `StdoutLogger | JSONLogger` depending on the `--output` flag. Widen the annotation to match the actual runtime type. - `if everything or not pii and not secrets:` parses correctly under Python precedence (`not` > `and` > `or`) but is easy to misread. Add explicit parentheses: `if everything or (not pii and not secrets):`. No behaviour change.

Five additions from a coverage audit of test_watchman_processor.py: - Drop dead `@patch('slack_watchman.watchman_processor.SlackClient')` decorators from `test_get_users` and `test_get_channels`. Both tests passed their own `MagicMock()` to the function under test and never used the patched symbol — the decorator was a no-op. - `test_find_auth_information` now asserts the full set of nine output keys, so a silent rename/drop/addition in the `output = {...}` literal can't slip past unnoticed. - `test_find_files_pool_size_matches_short_query_list`: mirrors the existing `find_messages` test, locking in that a sub-cap query list spawns one worker per query rather than always padding to the cap. - `test_find_files_pool_initargs_carry_parent_credentials`: mirrors the existing `find_messages` test, locking in that the pool initializer receives the parent client's `(token, url, session_token, cookie_dict)` so file workers don't redo the cookie -> session-token roundtrip. - `test_multipro_message_worker_falls_back_to_username_when_user_id_missing`: parametrised across `user=None`, `user=''`, and `user` key absent. Asserts `users.info` is not called and the raw `message['username']` is attached to the outgoing message — the bot/system-message fallback branch was previously uncovered.

PaperMtn and others added 23 commits April 28, 2026 14:22

Merge pull request #106 from PaperMtn/release/4.5.0

0a8a4c3

Release/4.5.0

docs: add CHANGELOG entry for #120 (dead is_dataclass guard removal)

3b9fc21

Missed in the previous commit. Same fix, no code changes. Refs #120

PaperMtn merged commit 0d38390 into develop Apr 30, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/processing bug fixes#128

Feature/processing bug fixes#128
PaperMtn merged 23 commits into
developfrom
feature/processing-bug-fixes

PaperMtn commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PaperMtn commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant