Delay transfer and end_call until agent finishes speaking#204
Delay transfer and end_call until agent finishes speaking#204drago-balto wants to merge 1 commit intocartesia-ai:mainfrom
Conversation
When the LLM generates speech text and a transfer/end_call tool in the same turn, the transfer would fire immediately over the websocket while TTS was still playing, cutting off the agent mid-sentence. This adds an `after_speech` flag to AgentTransferCall and AgentEndCall that makes the ConversationRunner wait for the TTS idle signal before sending the event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e5d41ca. Configure here.
| except asyncio.TimeoutError: | ||
| logger.warning( | ||
| f"Timed out waiting for speech to complete before {type(output).__name__}" | ||
| ) |
There was a problem hiding this comment.
Race condition: speech_done wait returns immediately, defeating feature
High Severity
The speech_done event is initialized as SET and only cleared when AgentStateInput(SPEAKING) arrives from the remote client. But in the primary use case (LLM generates text + end_call/transfer in the same turn), the runner task sends text and then immediately checks speech_done.wait() — long before the client has received the text, started TTS, and sent back the SPEAKING signal over the network. Since speech_done is still SET, wait() returns instantly, and the transfer/end_call fires immediately, cutting off the agent mid-sentence. The speech_done event needs to be cleared locally when AgentSendText is sent (where has_sent_text is set), not when the remote SPEAKING acknowledgement arrives.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit e5d41ca. Configure here.
|
Note: this implements the behavior for both end_call and transfer_call. However, the observation from testing is that the Cartesia harness may already be waiting for the agent to finish speaking before ending the call, and thus only the transfer_call changes are of any consequence. That also suggests that this PR may not be needed at all, and that a better way of handling this might be to extend the harness behavior (of waiting for agent to stop speaking) to both end_call and transfer_call. |
|
This will be fixed on the harness-side! We're also working on making these events interruptible/uninterruptible, so that you can reliably control speech before the terminal states play out. #210 Thank you for flagging and putting up the PR anyway! |


Summary
after_speech: bool = Falseflag toAgentTransferCallandAgentEndCalleventsafter_speech=True,ConversationRunnerwaits for the TTS idle signal before sending the event over the websocket, preventing the agent from being cut off mid-sentencetransfer_callandend_calltools setafter_speech=Trueby defaultafter_speech=FalseProblem
When the LLM generates speech text and a transfer/end_call tool call in the same turn, the transfer fires immediately while TTS is still playing, cutting off the agent mid-sentence. A fixed sleep delay is not a good solution since speech length varies.
How it works
ConversationRunnertracks TTS state viaAgentStateInput(speaking/idle) using anasyncio.Eventafter_speech=Trueevent, the runner waits for the idle signal (with a 30s safety timeout)Test plan
after_speech=False— transfer fires immediately, cutting off speech🤖 Generated with Claude Code
Note
Medium Risk
Changes call-control timing by delaying
end_call/transfer_calluntil TTS reports idle, which could introduce waits/timeouts or ordering differences in live call flows.Overview
Adds an
after_speechflag toAgentEndCallandAgentTransferCalloutput events so call-control actions can be deferred until the agent finishes speaking.Updates
ConversationRunnerto track TTS speaking/idle via anasyncio.Eventand, when an output event hasafter_speech=True(and text was sent this turn), waits up to 30s for speech to complete before sending the event over the websocket.Sets the built-in
end_callandtransfer_calltools to emitafter_speech=Trueby default, preventing transfers/hangups from cutting off same-turn TTS.Reviewed by Cursor Bugbot for commit e5d41ca. Bugbot is set up for automated code reviews on this repo. Configure here.