Release/4.6.0#130
Merged
Merged
Conversation
`_multipro_message_worker` and `_multipro_file_worker` previously had no try/except. Any exception in a worker (rate-limit retry exhausted, malformed payload, AttributeError on a missing field) killed the child process, and `join()` returned cleanly, so the parent silently lost results for that signature/query with zero log output. Workers now append failures to a shared `errors` Manager list, and `find_messages` / `find_files` log an `ERROR` line per captured failure after `join()`. Direct unit-test invocations without an `errors` list still get the original exception re-raised. Fixes #108
`_multipro_message_worker` chained `message.get('channel').get('id')`,
where the first `.get` was defensive but the second was not. Search
results for DMs, tombstoned messages, and certain bot/system messages
can come back with `channel` missing or set to None, raising
`AttributeError` and (since #108) silently capturing a worker failure
that should never have happened.
Replaced the chain with `(message.get('channel') or {}).get('id')` so a
missing channel falls through to `c = None` like the existing else
branch already handles.
Fixes #109
`_multipro_file_worker` called `.lower()` directly on `file_dict.get('name')`
and `file_dict.get('filetype')`. Slack files in deleted, redacted, or
tombstoned states return null for these fields, raising AttributeError
which (since #108) silently captures a worker failure that should never
have happened.
Both fields are now coerced once per file via `(file_dict.get(...) or '').lower()`
and reused across the file_types and no-file_types branches. Files with
null/missing name or filetype simply don't match the query rather than
taking down the worker.
Fixes #110
`OUTPUT_LOGGER` was set to '' at the top of `main()` and only replaced
with a real logger after argparse and the `metadata.metadata(...)` call.
Any exception raised in between fell into `except Exception` and called
`''.log('CRITICAL', e)`, raising AttributeError and masking the real
error.
Initialise the placeholder to `None` instead and guard both the
`TimeoutError` and `Exception` handlers, falling back to
`traceback.print_exc()` when the logger has not been constructed yet.
Fixes #111
chore: version bump
Each call previously instantiated three separate `multiprocessing.Manager()` objects — one per shared list (results, potential matches, errors) — each forking its own subprocess and relying on GC / atexit for cleanup. Wrap the body of `find_messages` and `find_files` in `with multiprocessing.Manager() as manager:` and call `manager.list()` three times. The manager subprocess is now started once per call and shut down deterministically when the block exits. The four existing find_messages/find_files tests are reworked through a small `_mock_manager_lists` helper that wires the context-manager protocol; two new tests assert the manager is entered and exited exactly once per call. Fixes #116
…uest `SlackClient._make_request` previously slept a fixed 90s on 429 and then called `self.session.request(...)` directly to retry, bypassing its own error handling. A second 429 in a row therefore propagated as a raw HTTPError, and (since #108) was logged at the worker boundary while the query result was lost. The 429 branch now reads `Retry-After` from the response (falling back to 90s when absent or unparseable), then recurses into `_make_request` so a second 429 receives the same handling. A `_retries` counter caps the recursion at 5 and raises an informative HTTPError on exhaustion. Adds `_parse_retry_after` for header parsing plus targeted tests for the four 429 paths (single retry, default fallback, second 429 still retried, retries exhausted). Fixes #115
…ckClient Replace the per-search-string `multiprocessing.Process` spawn in `find_messages` / `find_files` with a `multiprocessing.Pool`: - Pool size is capped at `_DEFAULT_POOL_SIZE` (8) and downsizes for signatures with fewer search strings, so signature size no longer dictates how many subprocesses run concurrently. - Each worker constructs its own `SlackClient` once at pool startup via `_init_worker_client`. The parent's connection pool / retry config is reproduced in each worker process rather than silently dropped during pickling. - The parent's pre-extracted `session_token` and `cookie_dict` are threaded through to the worker via new keyword-only kwargs on `SlackClient.__init__`, so workers do not redo the cookie -> session-token HTTP roundtrip on startup. Existing worker functions (`_multipro_message_worker`, `_multipro_file_worker`) keep their signatures; small `_pool_run_*` wrappers pull the worker-local SlackClient from a module global and forward the call. Test changes: existing find_messages/find_files tests now mock Pool instead of Process. New tests verify pool initargs match the parent client, pool size is capped at the cap and shrinks for short query lists, and `_init_worker_client` does not call `_get_session_token`. SlackClient gains tests for the new kwargs (session_token honoured, cookie_dict copied defensively, token still wins over session_token). Fixes #113 Fixes #114
Both worker functions previously hit the Slack API once per matching message/file for `users.info`, and once per match for `conversations.info` (message worker only) — the same user or channel was re-fetched every time it appeared. A worker with 1000 matches that referenced 5 unique users + 3 unique channels made 2000 API round trips. Add a per-worker `dict` cache keyed by user ID (and a separate one for channel ID in the message worker). Each unique ID is resolved once and reused for every subsequent match. The cache is local to a single worker call — no cross-worker IPC, no Manager dict, no shared state. The file worker now uses a small `_resolve_file_user` helper that preserves the pre-existing `is_dataclass` defensive check (the asymmetry between the file_types and no-file_types branches is left for a separate change). New tests assert that repeated user/channel IDs across matches result in a single `get_user_info` / `get_conversation_info` call per ID, covering both workers and both file-worker branches. Fixes #117
…gh to None Both functions are annotated `-> List[Dict]` but had no explicit return on the no-match and outer-exception paths, so they implicitly returned None. Existing callers happened to use `if results:` guards so the silent contract violation never bit, but anything iterating or measuring the result directly would have crashed. Add an explicit `return []` on each path and assert the empty-list contract in unit tests. Fixes #118
…atching file_type The per-file_type loop appended a result_dict for every sig.file_types entry that substring-matched a file's filetype, so a signature with overlapping file_types (e.g. ['zip', 'ip']) produced N duplicate entries per matching file. The post-hoc watchman_id dedup collapsed these later, but the dataclass construction, Manager IPC traffic, and worker-side memory were all wasted. Replace the loop with a single `any(...)` check so each matching file produces exactly one result. Fixes #119
`_resolve_file_user` short-circuited to None when `file_dict['user']` was a dataclass instance, but `file_dict` always comes straight from search.files where 'user' is a raw user-ID string — the dataclass branch was unreachable. The asymmetry between the file_types and no-file_types branches that flagged this had already been resolved by routing both through `_resolve_file_user`; this just drops the leftover guard and the now-unused `dataclasses` import. Fixes #120
Missed in the previous commit. Same fix, no code changes. Refs #120
`_multipro_message_worker` was recompiling each sig.patterns entry on every iteration of the message loop, and on the matching path it called Pattern.search a second time to extract .group(0). For N messages and P patterns this came out to N * P compiles and up to 2 * N * P searches. Compile each pattern once before the message loop, capture the Match object once, and reuse it for both the truthy check and the group-0 extraction. Now P compiles and N * P searches. Fixes #121
…tions Each entry now carries a `**[Critical]**` / `**[High]**` / `**[Medium]**` prefix matching the severity in PROCESSING_ISSUES.md, and entries inside Changed and Fixed are ordered by severity. Format remains Keep a Changelog-compatible.
…l "None"
`_multipro_message_worker` was doing `str(message.get('text'))`, which
turns missing or null text into the literal string 'None'. A permissive
signature regex could then false-positive match attachment-only
messages, bot/system events, or any other payload Slack returns without
a `text` field — producing a result row with no real text behind it.
Treat empty/missing text as a skip: read message['text'] without
str-coercion and `continue` when it's falsy.
Fixes #122
The unauthenticated workspace-login scrape ran without exception handling. In probe mode this was caught by the outer try/except (which exits with CRITICAL anyway), but in the authenticated flow a transient network blip or a Slack-side HTML/data-props schema change aborted the entire scan after auth had already succeeded. Wrap the request and parse logic in try/except for `requests.RequestException`, `json.JSONDecodeError`, and `AttributeError`. Return None (matching the existing "no info available" contract) and emit a WARNING via an optional logger argument. Both call sites in __init__.py now pass OUTPUT_LOGGER. Fixes #123
The function was annotated `Dict[str, List[str]] | None` but the actual payload mixes lists, booleans, strings and None — every value type except a list shows up. Type checkers (and any caller naive enough to trust the annotation, e.g. iterating values as lists) were wrong. Widen to `Dict[str, Any] | None` and add a runtime assertion in the existing happy-path test that locks in the heterogeneous contract. Fixes #124
Slack interprets the `after:` operator in the workspace timezone. The
previous `time.strftime('%Y-%m-%d', time.localtime(now - X))` formatted
in the host's TZ, so any host on a different TZ to the workspace got
an off-by-one-day window at the day boundary — typical for CI, cloud
runners, and distributed teams.
Switch to `time.gmtime` and factor the four inline branches in main()
into a small `_compute_timeframe(tm, now)` helper so the TZ behaviour
is unit-testable. Tests cover all four timeframes and trap any future
reintroduction of `time.localtime`.
Fixes #125
`f'{workspace_information.url}canvas/{channel.id}'` produced a
malformed URL like `https://example.slack.comcanvas/C123` whenever
team.info returned a workspace URL without a trailing slash.
Extract a small `_build_canvas_url(workspace_url, channel_id)` helper
that rstrip('/')s the base before concatenation, so both
`https://example.slack.com` and `https://example.slack.com/` resolve
to a single `/` between host and path.
Fixes #126
`cursor = r.get('response_metadata').get('next_cursor')` was set inside
the inner `for value in r.get(scope)` loop. On a normal page that
rewrote `cursor` to the same value N times — wasted but harmless. On a
page where `r.get(scope)` came back empty the inner loop never ran,
so `cursor` was never updated and the outer `while` re-requested the
same cursor on the next iteration: a potential infinite loop on
paginated responses that surface an empty `scope` list with a
non-empty `next_cursor`.
Move the assignment after the inner loop so the cursor advances once
per page regardless of how many items were returned. Add a regression
test that pages through three responses (one with an empty `members`
list) and asserts the cursor seen on each request matches the one
returned by the previous page.
Two small clarity fixes in `__init__.py`: - `OUTPUT_LOGGER` was annotated `JSONLogger` but `init_logger` returns `StdoutLogger | JSONLogger` depending on the `--output` flag. Widen the annotation to match the actual runtime type. - `if everything or not pii and not secrets:` parses correctly under Python precedence (`not` > `and` > `or`) but is easy to misread. Add explicit parentheses: `if everything or (not pii and not secrets):`. No behaviour change.
Five additions from a coverage audit of test_watchman_processor.py:
- Drop dead `@patch('slack_watchman.watchman_processor.SlackClient')`
decorators from `test_get_users` and `test_get_channels`. Both tests
passed their own `MagicMock()` to the function under test and never
used the patched symbol — the decorator was a no-op.
- `test_find_auth_information` now asserts the full set of nine output
keys, so a silent rename/drop/addition in the `output = {...}` literal
can't slip past unnoticed.
- `test_find_files_pool_size_matches_short_query_list`: mirrors the
existing `find_messages` test, locking in that a sub-cap query list
spawns one worker per query rather than always padding to the cap.
- `test_find_files_pool_initargs_carry_parent_credentials`: mirrors the
existing `find_messages` test, locking in that the pool initializer
receives the parent client's `(token, url, session_token, cookie_dict)`
so file workers don't redo the cookie -> session-token roundtrip.
- `test_multipro_message_worker_falls_back_to_username_when_user_id_missing`:
parametrised across `user=None`, `user=''`, and `user` key absent.
Asserts `users.info` is not called and the raw `message['username']`
is attached to the outgoing message — the bot/system-message fallback
branch was previously uncovered.
Feature/processing bug fixes
Prepares the 4.6.0 release: version in pyproject.toml updated from 4.5.0 to 4.6.0 and the [Unreleased] CHANGELOG block dated as [4.6.0] on 2026-05-04.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[4.6.0] - 2026-05-04
Changed
users.infoandconversations.infolookups by ID for the lifetime of the worker, eliminating redundant API calls per match. Fixes#117
find_messagesandfind_filesnow use a boundedmultiprocessing.Pool(capped at 8) instead of oneProcessper search string, preventing memory and file-descriptor usage from scaling with signaturesize. Fixes #113
SlackClientvia_init_worker_clientinstead of unpickling the parent client, preserving connection pool and retry config across workers. Fixes#114
SlackClient.__init__accepts new keyword-onlysession_tokenandcookie_dictparameters, used by pool workers to skip re-running_get_session_tokenon startup.find_messagesandfind_filesnow share a singlemultiprocessing.Managerper call, replacing three per-call managers that leaked until GC. Fixes#116
Fixed
AttributeError) are now caught and reported via a sharederrorslist instead of silently killing the child process. Fixes#108
_multipro_message_workerno longer crashes on messages with nochannelfield (DMs, tombstoned messages, bot events). Fixes #109_multipro_file_workerno longer crashes when a file'snameorfiletypeisNone(deleted/redacted/tombstoned states). Fixes #110main()error handler no longer crashes withAttributeErrorwhen an exception is raised beforeinit_loggerruns. Falls back totraceback.print_exc()when the logger is uninitialised. Fixes#111
SlackClient._make_requestnow honours theRetry-Afterheader on 429s (falling back to 90s), loops back through_make_requestfor consecutive 429s, and raises after a configurable retry cap. Fixes#115
_build_canvas_url(), which normalises the trailing slash on the workspace URL before concatenation. Fixes #126after:query timeframes are now computed in UTC (time.gmtime) instead of local time, fixing off-by-one-day windows on CI/cloud hosts. Refactored into_compute_timeframe()helper. Fixes#125
find_auth_informationreturn type widened toDict[str, Any] | Noneto reflect the actual mixed-type payload. Fixes #124find_auth_informationnow catchesRequestException,JSONDecodeError, andAttributeError, returningNonewith aWARNINGinstead of aborting the scan on transient scrape failures. Fixes#123
find_messagesandfind_filesnow explicitly return[]on no-match and exception paths instead of implicitNone. Fixes #118_multipro_file_workerno longer appends a duplicate result persig.file_typesentry; replaced per-type loop with a singleany()check. Fixes#119
is_dataclassguard and unuseddataclassesimport from_resolve_file_user. Fixes #120_multipro_message_workernow skips messages with notextinstead of coercingNoneto the string'None', preventing false positives on attachment-only and system messages. Fixes#122
_multipro_message_workernow compiles each pattern once per worker and reuses theMatchobject, eliminating per-message recompilation and a duplicatesearchcall on the match path. Fixes#121
cursor_api_searchnow updatescursoronce per page after items are collected, fixing a potential infinite loop on empty pages with a non-emptynext_cursor. Fixes cursor_api_search.OUTPUT_LOGGERannotation in__init__.pywidened toStdoutLogger | JSONLogger.if everything or (not pii and not secrets):for readability; no behaviour change.