Skip to content

Delay transfer and end_call until agent finishes speaking#204

Closed
drago-balto wants to merge 1 commit intocartesia-ai:mainfrom
drago-balto:dmm/delay-transfer-end-call
Closed

Delay transfer and end_call until agent finishes speaking#204
drago-balto wants to merge 1 commit intocartesia-ai:mainfrom
drago-balto:dmm/delay-transfer-end-call

Conversation

@drago-balto
Copy link
Copy Markdown
Contributor

@drago-balto drago-balto commented Apr 13, 2026

Summary

  • Adds after_speech: bool = False flag to AgentTransferCall and AgentEndCall events
  • When after_speech=True, ConversationRunner waits for the TTS idle signal before sending the event over the websocket, preventing the agent from being cut off mid-sentence
  • Built-in transfer_call and end_call tools set after_speech=True by default
  • Custom tools can opt out by setting after_speech=False

Problem

When the LLM generates speech text and a transfer/end_call tool call in the same turn, the transfer fires immediately while TTS is still playing, cutting off the agent mid-sentence. A fixed sleep delay is not a good solution since speech length varies.

How it works

  • ConversationRunner tracks TTS state via AgentStateInput (speaking/idle) using an asyncio.Event
  • Before sending an after_speech=True event, the runner waits for the idle signal (with a 30s safety timeout)
  • If no text was sent in the current turn, the wait is skipped entirely

Test plan

  • All existing tests pass (380 passed)
  • Manual test: agent says long farewell + transfers in same turn — speech completes before transfer
  • Manual test: after_speech=False — transfer fires immediately, cutting off speech
  • Manual test: transfer with no preceding text — no hang, transfers immediately

🤖 Generated with Claude Code


Note

Medium Risk
Changes call-control timing by delaying end_call/transfer_call until TTS reports idle, which could introduce waits/timeouts or ordering differences in live call flows.

Overview
Adds an after_speech flag to AgentEndCall and AgentTransferCall output events so call-control actions can be deferred until the agent finishes speaking.

Updates ConversationRunner to track TTS speaking/idle via an asyncio.Event and, when an output event has after_speech=True (and text was sent this turn), waits up to 30s for speech to complete before sending the event over the websocket.

Sets the built-in end_call and transfer_call tools to emit after_speech=True by default, preventing transfers/hangups from cutting off same-turn TTS.

Reviewed by Cursor Bugbot for commit e5d41ca. Bugbot is set up for automated code reviews on this repo. Configure here.

When the LLM generates speech text and a transfer/end_call tool in the
same turn, the transfer would fire immediately over the websocket while
TTS was still playing, cutting off the agent mid-sentence. This adds an
`after_speech` flag to AgentTransferCall and AgentEndCall that makes the
ConversationRunner wait for the TTS idle signal before sending the event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e5d41ca. Configure here.

Comment thread line/voice_agent_app.py
except asyncio.TimeoutError:
logger.warning(
f"Timed out waiting for speech to complete before {type(output).__name__}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: speech_done wait returns immediately, defeating feature

High Severity

The speech_done event is initialized as SET and only cleared when AgentStateInput(SPEAKING) arrives from the remote client. But in the primary use case (LLM generates text + end_call/transfer in the same turn), the runner task sends text and then immediately checks speech_done.wait() — long before the client has received the text, started TTS, and sent back the SPEAKING signal over the network. Since speech_done is still SET, wait() returns instantly, and the transfer/end_call fires immediately, cutting off the agent mid-sentence. The speech_done event needs to be cleared locally when AgentSendText is sent (where has_sent_text is set), not when the remote SPEAKING acknowledgement arrives.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e5d41ca. Configure here.

@drago-balto
Copy link
Copy Markdown
Contributor Author

drago-balto commented Apr 13, 2026

Note: this implements the behavior for both end_call and transfer_call. However, the observation from testing is that the Cartesia harness may already be waiting for the agent to finish speaking before ending the call, and thus only the transfer_call changes are of any consequence.

That also suggests that this PR may not be needed at all, and that a better way of handling this might be to extend the harness behavior (of waiting for agent to stop speaking) to both end_call and transfer_call.

@sauhardjain
Copy link
Copy Markdown
Collaborator

This will be fixed on the harness-side! We're also working on making these events interruptible/uninterruptible, so that you can reliably control speech before the terminal states play out. #210

Thank you for flagging and putting up the PR anyway!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants