Skip to content

Async predictors crash on 0.17 -- setup() and predict() run on different event loops #2926

@michaeldwan

Description

@michaeldwan

Async predictors that create event-loop-bound resources in setup() (like httpx.AsyncClient) crash on cog 0.17 because setup() and predict() no longer share the same asyncio event loop.

In 0.16, asyncio.run() created one event loop for the entire subprocess lifecycle -- sync setup() ran inline, then predict() coroutines ran on the same loop via TaskGroup. An httpx.AsyncClient created in setup() would lazily bind to this loop on first use, and everything worked.

In 0.17's coglet runtime, the architecture split:

  1. setup() runs on a tokio worker thread with no running asyncio event loop (worker_bridge.rs:278-305)
  2. If setup() is async, it runs via asyncio.run() which creates a temporary event loop that's destroyed after setup finishes (predictor.rs:553-583)
  3. predict() coroutines are submitted to a separate dedicated event loop thread via run_coroutine_threadsafe (worker_bridge.rs:469-523)

Any async resource created during setup() is bound to either no loop or the ephemeral asyncio.run() loop. When predict() tries to use it on the dedicated loop, the resource fails -- closed loop, wrong loop, or connection pool mismatch.

Reproduction

from cog import BasePredictor, Path
import httpx

class Predictor(BasePredictor):
    def setup(self):
        self.client = httpx.AsyncClient(timeout=300.0)

    async def predict(self, prompt: str) -> str:
        r = await self.client.get("https://example.com")
        return r.text

This works on 0.16, crashes on 0.17. The crash happens inside the Python subprocess before telemetry fires, so no error message is captured -- the orchestrator just sees the subprocess die.

Symptoms

  • Container crashes almost immediately when handling a prediction (~1.5s)
  • No error message, shutdown cause, or HTTP status captured in telemetry
  • Load-sensitive: scales with concurrency, follows diurnal traffic patterns
  • Affects async predictors that use httpx.AsyncClient, aiohttp.ClientSession, or any other event-loop-bound resource created in setup()

Proposed fix

Run async setup() on the same dedicated event loop that predict() uses. The change is in worker_bridge.rs::setup() -- when setup_is_async, submit the setup coroutine via run_coroutine_threadsafe(coro, shared_loop) + future.result() instead of asyncio.run(). Sync setup() stays as-is.

This preserves the blocking semantics (worker.rs still waits for setup to complete before sending Ready), requires no changes to the worker lifecycle or orchestrator, and matches 0.16's behavior where both ran on the same loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions