Skip to content

Release/4.6.0#130

Merged
PaperMtn merged 26 commits into
masterfrom
release/4.6.0
May 4, 2026
Merged

Release/4.6.0#130
PaperMtn merged 26 commits into
masterfrom
release/4.6.0

Conversation

@PaperMtn

@PaperMtn PaperMtn commented May 4, 2026

Copy link
Copy Markdown
Owner

[4.6.0] - 2026-05-04

Changed

  • [High] Message and file workers now cache users.info and conversations.info lookups by ID for the lifetime of the worker, eliminating redundant API calls per match. Fixes
    #117
  • [High] find_messages and find_files now use a bounded multiprocessing.Pool (capped at 8) instead of one Process per search string, preventing memory and file-descriptor usage from scaling with signature
    size. Fixes #113
  • [High] Pool workers now construct their own SlackClient via _init_worker_client instead of unpickling the parent client, preserving connection pool and retry config across workers. Fixes
    #114
  • [High] SlackClient.__init__ accepts new keyword-only session_token and cookie_dict parameters, used by pool workers to skip re-running _get_session_token on startup.
  • [High] find_messages and find_files now share a single multiprocessing.Manager per call, replacing three per-call managers that leaked until GC. Fixes
    #116

Fixed

  • [Critical] Worker exceptions (rate-limit exhaustion, malformed payloads, AttributeError) are now caught and reported via a shared errors list instead of silently killing the child process. Fixes
    #108
  • [Critical] _multipro_message_worker no longer crashes on messages with no channel field (DMs, tombstoned messages, bot events). Fixes #109
  • [Critical] _multipro_file_worker no longer crashes when a file's name or filetype is None (deleted/redacted/tombstoned states). Fixes #110
  • [Critical] main() error handler no longer crashes with AttributeError when an exception is raised before init_logger runs. Falls back to traceback.print_exc() when the logger is uninitialised. Fixes
    #111
  • [High] SlackClient._make_request now honours the Retry-After header on 429s (falling back to 90s), loops back through _make_request for consecutive 429s, and raises after a configurable retry cap. Fixes
    #115
  • [Medium] Canvas URLs are now built via _build_canvas_url(), which normalises the trailing slash on the workspace URL before concatenation. Fixes #126
  • [Medium] after: query timeframes are now computed in UTC (time.gmtime) instead of local time, fixing off-by-one-day windows on CI/cloud hosts. Refactored into _compute_timeframe() helper. Fixes
    #125
  • [Medium] find_auth_information return type widened to Dict[str, Any] | None to reflect the actual mixed-type payload. Fixes #124
  • [Medium] find_auth_information now catches RequestException, JSONDecodeError, and AttributeError, returning None with a WARNING instead of aborting the scan on transient scrape failures. Fixes
    #123
  • [Medium] find_messages and find_files now explicitly return [] on no-match and exception paths instead of implicit None. Fixes #118
  • [Medium] _multipro_file_worker no longer appends a duplicate result per sig.file_types entry; replaced per-type loop with a single any() check. Fixes
    #119
  • [Medium] Removed dead is_dataclass guard and unused dataclasses import from _resolve_file_user. Fixes #120
  • [Medium] _multipro_message_worker now skips messages with no text instead of coercing None to the string 'None', preventing false positives on attachment-only and system messages. Fixes
    #122
  • [Medium] _multipro_message_worker now compiles each pattern once per worker and reuses the Match object, eliminating per-message recompilation and a duplicate search call on the match path. Fixes
    #121
  • [Low] cursor_api_search now updates cursor once per page after items are collected, fixing a potential infinite loop on empty pages with a non-empty next_cursor. Fixes cursor_api_search.
  • [Low] OUTPUT_LOGGER annotation in __init__.py widened to StdoutLogger | JSONLogger.
  • [Low] Added explicit parentheses to if everything or (not pii and not secrets): for readability; no behaviour change.

PaperMtn and others added 26 commits April 28, 2026 15:18
`_multipro_message_worker` and `_multipro_file_worker` previously had no
try/except. Any exception in a worker (rate-limit retry exhausted,
malformed payload, AttributeError on a missing field) killed the child
process, and `join()` returned cleanly, so the parent silently lost
results for that signature/query with zero log output.

Workers now append failures to a shared `errors` Manager list, and
`find_messages` / `find_files` log an `ERROR` line per captured failure
after `join()`. Direct unit-test invocations without an `errors` list
still get the original exception re-raised.

Fixes #108
`_multipro_message_worker` chained `message.get('channel').get('id')`,
where the first `.get` was defensive but the second was not. Search
results for DMs, tombstoned messages, and certain bot/system messages
can come back with `channel` missing or set to None, raising
`AttributeError` and (since #108) silently capturing a worker failure
that should never have happened.

Replaced the chain with `(message.get('channel') or {}).get('id')` so a
missing channel falls through to `c = None` like the existing else
branch already handles.

Fixes #109
`_multipro_file_worker` called `.lower()` directly on `file_dict.get('name')`
and `file_dict.get('filetype')`. Slack files in deleted, redacted, or
tombstoned states return null for these fields, raising AttributeError
which (since #108) silently captures a worker failure that should never
have happened.

Both fields are now coerced once per file via `(file_dict.get(...) or '').lower()`
and reused across the file_types and no-file_types branches. Files with
null/missing name or filetype simply don't match the query rather than
taking down the worker.

Fixes #110
`OUTPUT_LOGGER` was set to '' at the top of `main()` and only replaced
with a real logger after argparse and the `metadata.metadata(...)` call.
Any exception raised in between fell into `except Exception` and called
`''.log('CRITICAL', e)`, raising AttributeError and masking the real
error.

Initialise the placeholder to `None` instead and guard both the
`TimeoutError` and `Exception` handlers, falling back to
`traceback.print_exc()` when the logger has not been constructed yet.

Fixes #111
Each call previously instantiated three separate `multiprocessing.Manager()`
objects — one per shared list (results, potential matches, errors) — each
forking its own subprocess and relying on GC / atexit for cleanup.

Wrap the body of `find_messages` and `find_files` in
`with multiprocessing.Manager() as manager:` and call `manager.list()`
three times. The manager subprocess is now started once per call and
shut down deterministically when the block exits.

The four existing find_messages/find_files tests are reworked through a
small `_mock_manager_lists` helper that wires the context-manager
protocol; two new tests assert the manager is entered and exited
exactly once per call.

Fixes #116
…uest

`SlackClient._make_request` previously slept a fixed 90s on 429 and then
called `self.session.request(...)` directly to retry, bypassing its own
error handling. A second 429 in a row therefore propagated as a raw
HTTPError, and (since #108) was logged at the worker boundary while the
query result was lost.

The 429 branch now reads `Retry-After` from the response (falling back
to 90s when absent or unparseable), then recurses into `_make_request`
so a second 429 receives the same handling. A `_retries` counter caps
the recursion at 5 and raises an informative HTTPError on exhaustion.

Adds `_parse_retry_after` for header parsing plus targeted tests for
the four 429 paths (single retry, default fallback, second 429 still
retried, retries exhausted).

Fixes #115
…ckClient

Replace the per-search-string `multiprocessing.Process` spawn in
`find_messages` / `find_files` with a `multiprocessing.Pool`:

- Pool size is capped at `_DEFAULT_POOL_SIZE` (8) and downsizes for
  signatures with fewer search strings, so signature size no longer
  dictates how many subprocesses run concurrently.
- Each worker constructs its own `SlackClient` once at pool startup via
  `_init_worker_client`. The parent's connection pool / retry config is
  reproduced in each worker process rather than silently dropped during
  pickling.
- The parent's pre-extracted `session_token` and `cookie_dict` are
  threaded through to the worker via new keyword-only kwargs on
  `SlackClient.__init__`, so workers do not redo the
  cookie -> session-token HTTP roundtrip on startup.

Existing worker functions (`_multipro_message_worker`,
`_multipro_file_worker`) keep their signatures; small `_pool_run_*`
wrappers pull the worker-local SlackClient from a module global and
forward the call.

Test changes: existing find_messages/find_files tests now mock Pool
instead of Process. New tests verify pool initargs match the parent
client, pool size is capped at the cap and shrinks for short query
lists, and `_init_worker_client` does not call `_get_session_token`.
SlackClient gains tests for the new kwargs (session_token honoured,
cookie_dict copied defensively, token still wins over session_token).

Fixes #113
Fixes #114
Both worker functions previously hit the Slack API once per matching
message/file for `users.info`, and once per match for
`conversations.info` (message worker only) — the same user or channel
was re-fetched every time it appeared. A worker with 1000 matches that
referenced 5 unique users + 3 unique channels made 2000 API round
trips.

Add a per-worker `dict` cache keyed by user ID (and a separate one for
channel ID in the message worker). Each unique ID is resolved once and
reused for every subsequent match. The cache is local to a single
worker call — no cross-worker IPC, no Manager dict, no shared state.

The file worker now uses a small `_resolve_file_user` helper that
preserves the pre-existing `is_dataclass` defensive check (the
asymmetry between the file_types and no-file_types branches is left
for a separate change).

New tests assert that repeated user/channel IDs across matches result
in a single `get_user_info` / `get_conversation_info` call per ID,
covering both workers and both file-worker branches.

Fixes #117
…gh to None

Both functions are annotated `-> List[Dict]` but had no explicit return on
the no-match and outer-exception paths, so they implicitly returned None.
Existing callers happened to use `if results:` guards so the silent contract
violation never bit, but anything iterating or measuring the result directly
would have crashed.

Add an explicit `return []` on each path and assert the empty-list contract
in unit tests.

Fixes #118
…atching file_type

The per-file_type loop appended a result_dict for every sig.file_types
entry that substring-matched a file's filetype, so a signature with
overlapping file_types (e.g. ['zip', 'ip']) produced N duplicate entries
per matching file. The post-hoc watchman_id dedup collapsed these later,
but the dataclass construction, Manager IPC traffic, and worker-side
memory were all wasted.

Replace the loop with a single `any(...)` check so each matching file
produces exactly one result.

Fixes #119
`_resolve_file_user` short-circuited to None when `file_dict['user']` was
a dataclass instance, but `file_dict` always comes straight from
search.files where 'user' is a raw user-ID string — the dataclass branch
was unreachable. The asymmetry between the file_types and no-file_types
branches that flagged this had already been resolved by routing both
through `_resolve_file_user`; this just drops the leftover guard and the
now-unused `dataclasses` import.

Fixes #120
Missed in the previous commit. Same fix, no code changes.

Refs #120
`_multipro_message_worker` was recompiling each sig.patterns entry on
every iteration of the message loop, and on the matching path it called
Pattern.search a second time to extract .group(0). For N messages and
P patterns this came out to N * P compiles and up to 2 * N * P searches.

Compile each pattern once before the message loop, capture the Match
object once, and reuse it for both the truthy check and the group-0
extraction. Now P compiles and N * P searches.

Fixes #121
…tions

Each entry now carries a `**[Critical]**` / `**[High]**` / `**[Medium]**`
prefix matching the severity in PROCESSING_ISSUES.md, and entries inside
Changed and Fixed are ordered by severity. Format remains Keep a
Changelog-compatible.
…l "None"

`_multipro_message_worker` was doing `str(message.get('text'))`, which
turns missing or null text into the literal string 'None'. A permissive
signature regex could then false-positive match attachment-only
messages, bot/system events, or any other payload Slack returns without
a `text` field — producing a result row with no real text behind it.

Treat empty/missing text as a skip: read message['text'] without
str-coercion and `continue` when it's falsy.

Fixes #122
The unauthenticated workspace-login scrape ran without exception
handling. In probe mode this was caught by the outer try/except (which
exits with CRITICAL anyway), but in the authenticated flow a transient
network blip or a Slack-side HTML/data-props schema change aborted the
entire scan after auth had already succeeded.

Wrap the request and parse logic in try/except for
`requests.RequestException`, `json.JSONDecodeError`, and
`AttributeError`. Return None (matching the existing "no info
available" contract) and emit a WARNING via an optional logger
argument. Both call sites in __init__.py now pass OUTPUT_LOGGER.

Fixes #123
The function was annotated `Dict[str, List[str]] | None` but the actual
payload mixes lists, booleans, strings and None — every value type
except a list shows up. Type checkers (and any caller naive enough to
trust the annotation, e.g. iterating values as lists) were wrong.

Widen to `Dict[str, Any] | None` and add a runtime assertion in the
existing happy-path test that locks in the heterogeneous contract.

Fixes #124
Slack interprets the `after:` operator in the workspace timezone. The
previous `time.strftime('%Y-%m-%d', time.localtime(now - X))` formatted
in the host's TZ, so any host on a different TZ to the workspace got
an off-by-one-day window at the day boundary — typical for CI, cloud
runners, and distributed teams.

Switch to `time.gmtime` and factor the four inline branches in main()
into a small `_compute_timeframe(tm, now)` helper so the TZ behaviour
is unit-testable. Tests cover all four timeframes and trap any future
reintroduction of `time.localtime`.

Fixes #125
`f'{workspace_information.url}canvas/{channel.id}'` produced a
malformed URL like `https://example.slack.comcanvas/C123` whenever
team.info returned a workspace URL without a trailing slash.

Extract a small `_build_canvas_url(workspace_url, channel_id)` helper
that rstrip('/')s the base before concatenation, so both
`https://example.slack.com` and `https://example.slack.com/` resolve
to a single `/` between host and path.

Fixes #126
`cursor = r.get('response_metadata').get('next_cursor')` was set inside
the inner `for value in r.get(scope)` loop. On a normal page that
rewrote `cursor` to the same value N times — wasted but harmless. On a
page where `r.get(scope)` came back empty the inner loop never ran,
so `cursor` was never updated and the outer `while` re-requested the
same cursor on the next iteration: a potential infinite loop on
paginated responses that surface an empty `scope` list with a
non-empty `next_cursor`.

Move the assignment after the inner loop so the cursor advances once
per page regardless of how many items were returned. Add a regression
test that pages through three responses (one with an empty `members`
list) and asserts the cursor seen on each request matches the one
returned by the previous page.
Two small clarity fixes in `__init__.py`:

- `OUTPUT_LOGGER` was annotated `JSONLogger` but `init_logger` returns
  `StdoutLogger | JSONLogger` depending on the `--output` flag. Widen
  the annotation to match the actual runtime type.

- `if everything or not pii and not secrets:` parses correctly under
  Python precedence (`not` > `and` > `or`) but is easy to misread.
  Add explicit parentheses: `if everything or (not pii and not secrets):`.
  No behaviour change.
Five additions from a coverage audit of test_watchman_processor.py:

- Drop dead `@patch('slack_watchman.watchman_processor.SlackClient')`
  decorators from `test_get_users` and `test_get_channels`. Both tests
  passed their own `MagicMock()` to the function under test and never
  used the patched symbol — the decorator was a no-op.

- `test_find_auth_information` now asserts the full set of nine output
  keys, so a silent rename/drop/addition in the `output = {...}` literal
  can't slip past unnoticed.

- `test_find_files_pool_size_matches_short_query_list`: mirrors the
  existing `find_messages` test, locking in that a sub-cap query list
  spawns one worker per query rather than always padding to the cap.

- `test_find_files_pool_initargs_carry_parent_credentials`: mirrors the
  existing `find_messages` test, locking in that the pool initializer
  receives the parent client's `(token, url, session_token, cookie_dict)`
  so file workers don't redo the cookie -> session-token roundtrip.

- `test_multipro_message_worker_falls_back_to_username_when_user_id_missing`:
  parametrised across `user=None`, `user=''`, and `user` key absent.
  Asserts `users.info` is not called and the raw `message['username']`
  is attached to the outgoing message — the bot/system-message fallback
  branch was previously uncovered.
Prepares the 4.6.0 release: version in pyproject.toml updated from
4.5.0 to 4.6.0 and the [Unreleased] CHANGELOG block dated as [4.6.0]
on 2026-05-04.
@PaperMtn PaperMtn merged commit a335125 into master May 4, 2026
12 checks passed
@PaperMtn PaperMtn deleted the release/4.6.0 branch May 4, 2026 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment