Async predictors that create event-loop-bound resources in setup() (like httpx.AsyncClient) crash on cog 0.17 because setup() and predict() no longer share the same asyncio event loop.
In 0.16, asyncio.run() created one event loop for the entire subprocess lifecycle -- sync setup() ran inline, then predict() coroutines ran on the same loop via TaskGroup. An httpx.AsyncClient created in setup() would lazily bind to this loop on first use, and everything worked.
In 0.17's coglet runtime, the architecture split:
setup() runs on a tokio worker thread with no running asyncio event loop (worker_bridge.rs:278-305)
- If
setup() is async, it runs via asyncio.run() which creates a temporary event loop that's destroyed after setup finishes (predictor.rs:553-583)
predict() coroutines are submitted to a separate dedicated event loop thread via run_coroutine_threadsafe (worker_bridge.rs:469-523)
Any async resource created during setup() is bound to either no loop or the ephemeral asyncio.run() loop. When predict() tries to use it on the dedicated loop, the resource fails -- closed loop, wrong loop, or connection pool mismatch.
Reproduction
from cog import BasePredictor, Path
import httpx
class Predictor(BasePredictor):
def setup(self):
self.client = httpx.AsyncClient(timeout=300.0)
async def predict(self, prompt: str) -> str:
r = await self.client.get("https://example.com")
return r.text
This works on 0.16, crashes on 0.17. The crash happens inside the Python subprocess before telemetry fires, so no error message is captured -- the orchestrator just sees the subprocess die.
Symptoms
- Container crashes almost immediately when handling a prediction (~1.5s)
- No error message, shutdown cause, or HTTP status captured in telemetry
- Load-sensitive: scales with concurrency, follows diurnal traffic patterns
- Affects async predictors that use
httpx.AsyncClient, aiohttp.ClientSession, or any other event-loop-bound resource created in setup()
Proposed fix
Run async setup() on the same dedicated event loop that predict() uses. The change is in worker_bridge.rs::setup() -- when setup_is_async, submit the setup coroutine via run_coroutine_threadsafe(coro, shared_loop) + future.result() instead of asyncio.run(). Sync setup() stays as-is.
This preserves the blocking semantics (worker.rs still waits for setup to complete before sending Ready), requires no changes to the worker lifecycle or orchestrator, and matches 0.16's behavior where both ran on the same loop.
Async predictors that create event-loop-bound resources in
setup()(likehttpx.AsyncClient) crash on cog 0.17 becausesetup()andpredict()no longer share the same asyncio event loop.In 0.16,
asyncio.run()created one event loop for the entire subprocess lifecycle -- syncsetup()ran inline, thenpredict()coroutines ran on the same loop viaTaskGroup. Anhttpx.AsyncClientcreated insetup()would lazily bind to this loop on first use, and everything worked.In 0.17's coglet runtime, the architecture split:
setup()runs on a tokio worker thread with no running asyncio event loop (worker_bridge.rs:278-305)setup()is async, it runs viaasyncio.run()which creates a temporary event loop that's destroyed after setup finishes (predictor.rs:553-583)predict()coroutines are submitted to a separate dedicated event loop thread viarun_coroutine_threadsafe(worker_bridge.rs:469-523)Any async resource created during
setup()is bound to either no loop or the ephemeralasyncio.run()loop. Whenpredict()tries to use it on the dedicated loop, the resource fails -- closed loop, wrong loop, or connection pool mismatch.Reproduction
This works on 0.16, crashes on 0.17. The crash happens inside the Python subprocess before telemetry fires, so no error message is captured -- the orchestrator just sees the subprocess die.
Symptoms
httpx.AsyncClient,aiohttp.ClientSession, or any other event-loop-bound resource created insetup()Proposed fix
Run async
setup()on the same dedicated event loop thatpredict()uses. The change is inworker_bridge.rs::setup()-- whensetup_is_async, submit the setup coroutine viarun_coroutine_threadsafe(coro, shared_loop)+future.result()instead ofasyncio.run(). Syncsetup()stays as-is.This preserves the blocking semantics (
worker.rsstill waits for setup to complete before sendingReady), requires no changes to the worker lifecycle or orchestrator, and matches 0.16's behavior where both ran on the same loop.