fix(dflash): daemon-mode OOM resilience (refs #114)#166
Merged
Conversation
When the daemon hits a cudaMalloc OOM during prefill — either in the
gallocr pre-reserve or inside the token-segmented prefill loop — the
process used to `return 1;` and die, taking every subsequent request
with it (BrokenPipe on the next stdin write from server.py).
This patch keeps the daemon alive:
* test_dflash.cpp — at the three fatal OOM sites in token-segmented
prefill (gallocr pre-reserve, in-loop build, in-loop compute), free
every prefix snapshot to recover VRAM, emit `[snap] all-cleared` so
the Python side can invalidate its LRU, and `stream_emit(-1)` +
abort the current request via `continue` (gallocr site) or `goto
_req_aborted_oom` (in-loop sites that need to escape the prefill
for-loop before reaching the daemon while-loop). One-shot CLI mode
(daemon_mode=false) still `return 1`s as before.
* prefix_cache.py — `DaemonStdoutBus.register_line_callback(prefix, cb)`
invokes `cb()` whenever the daemon emits a stdout line starting with
`prefix`. `PrefixCache.mark_all_cleared()` drops every LRU entry and
resets `next_slot`/`_pending_evict_key` so the next request does not
RESTORE a freed slot.
* server.py — wire `bus.register_line_callback("[snap] all-cleared",
prefix_cache.mark_all_cleared)` after the cache is built.
Validated with NousResearch/hermes-agent (31 tools, ~17K-token prompts)
on RTX 3090 / Qwen3.6-27B Q4_K_M, max_ctx=65536: 10/10 multi-turn tasks
complete cleanly, daemon stable. Recommended low-VRAM serving config
that fits hermes-agent on a single 3090:
DFLASH27B_KV_TQ3=1 DFLASH27B_PREFILL_UBATCH=128 \
python3 scripts/server.py \
--tokenizer Qwen/Qwen3.6-27B --port 8000 \
--max-ctx 65536 --fa-window 1024 \
--prefix-cache-slots 1 --budget 8 --daemon
Refs #114. Does not auto-close; maintainers can close when satisfied.
5946c89 to
2781459
Compare
Contributor
There was a problem hiding this comment.
1 issue found across 3 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/prefix_cache.py">
<violation number="1" location="dflash/scripts/prefix_cache.py:554">
P1: `mark_all_cleared()` only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Comment on lines
+554
to
+558
| n = len(self.entries) | ||
| self.entries.clear() | ||
| self.next_slot = 0 | ||
| self._pending_evict_key = None | ||
| print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True) |
Contributor
There was a problem hiding this comment.
P1: mark_all_cleared() only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/prefix_cache.py, line 554:
<comment>`mark_all_cleared()` only resets prefix-cache state and leaves full-cache slot metadata intact, so post-OOM requests can reuse stale full-cache slot ids and fail restores.</comment>
<file context>
@@ -525,6 +541,22 @@ def lookup(self, prompt_ids: list[int]) -> tuple[int, int] | None:
+ """
+ if self.disabled:
+ return
+ n = len(self.entries)
+ self.entries.clear()
+ self.next_slot = 0
</file context>
Suggested change
| n = len(self.entries) | |
| self.entries.clear() | |
| self.next_slot = 0 | |
| self._pending_evict_key = None | |
| print(f"{self.log_prefix} all-cleared — dropped {n} LRU entries", flush=True) | |
| n_prefix = len(self.entries) | |
| self.entries.clear() | |
| self.next_slot = 0 | |
| self._pending_evict_key = None | |
| n_full = 0 | |
| if not getattr(self, "_full_disabled", True): | |
| n_full = len(self.full_entries) | |
| self.full_entries.clear() | |
| self._full_next_slot = 0 | |
| self._full_pending_evict_key = None | |
| self._full_pending_evict_path = None | |
| print(f"{self.log_prefix} all-cleared — dropped {n_prefix} prefix + {n_full} full entries", flush=True) |
oliveagle
pushed a commit
to oliveagle/lucebox-hub
that referenced
this pull request
May 22, 2026
…m-resilience fix(dflash): daemon-mode OOM resilience (refs Luce-Org#114)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refs #114. Reduces the daemon-death symptom: daemon now survives transient cudaMalloc OOM during prefill instead of dying and taking every subsequent request with it. Does not auto-close #114 — the underlying VRAM pressure at
max_ctx=65536on 24 GB is still real; this PR is defense in depth + a documented workaround config that fits hermes-agent. Maintainers can close the issue when they consider it resolved.Failure mode
At
max_ctx=65536on a 24 GB card, target weights (~15 GB) + KV reservations + a single prefix-cache snapshot leave ~1-1.5 GB headroom. The prefill gallocr pre-reserve needs ~1.2 GB of scratch; once a snap is committed it pushes us into a regime where one transientcudaMalloc(1216 MiB)fails. The daemon's three OOM-fatal sites in token-segmented prefill (prefill gallocr pre-reserve failed,prefill build @N,prefill compute @N) allreturn 1;frommain()— the process dies. Next chat completion hitsBrokenPipewriting to daemon stdin.What this PR changes (65 lines)
dflash/test/test_dflash.cpp[snap] all-clearedon stdout,stream_emit(-1)+ abort the current request viacontinue(gallocr site) orgoto _req_aborted_oom(in-loop sites that need to escape the prefill for-loop). One-shot CLI mode (daemon_mode=false) keeps fail-fastreturn 1semantics unchanged.dflash/scripts/prefix_cache.pyDaemonStdoutBus.register_line_callback(prefix, cb)invokescb()whenever the daemon emits a stdout line starting withprefix.PrefixCache.mark_all_cleared()drops every LRU entry + resetsnext_slot/_pending_evict_key.dflash/scripts/server.pybus.register_line_callback(\"[snap] all-cleared\", prefix_cache.mark_all_cleared)after the cache is built.Validation
NousResearch/hermes-agent (31 tools, ~17K-token prompts) on RTX 3090 + Qwen3.6-27B Q4_K_M,
--max-ctx 65536:api_callsRecommended low-VRAM serving config
For users hitting #114 on a single 3090 with hermes-agent's mandatory 64K context floor:
The critical lever is
--budget 8(vs default 22): a smaller DDTree spec-verify tree halves the prefill gallocr scratch and avoids the OOM in the first place. The patch is defense-in-depth for users who push past their VRAM budget.Test plan
cmake --build build --target test_dflashsucceeds./test_dflash target.gguf draft.safetensors prompt.bin 256 out.bin) still fail-fasts on OOM🧙 Built with WOZCODE