Improve the audio handling logic in AgentBase class

DavdGao · DavdGao · commit bbb9eb0953e3 · 2025-09-19T18:07:06.000+08:00
diff --git a/README.md b/README.md
@@ -103,6 +103,7 @@ Quick overview of important features in **AgentScope 1.0**:
 | ...        |                                                                                    |                                                                         |
 
 ## 📢 News
+- **[2025-09]** **Voice agent** is online! `ReActAgent` supports Qwen-Omni and GPT-Audio natively now, check our [new example](https://github.com/agentscope-ai/agentscope/tree/main/examples/agent/voice_agent) and [roadmap](https://github.com/agentscope-ai/agentscope/issues/773).
 - **[2025-09]** A new powerful 📋**Plan** module is online now! Check out the [tutorial](https://doc.agentscope.io/tutorial/task_plan.html) for more details.
 - **[2025-09]** **AgentScope Runtime** is open-sourced now! Enabling effective agent deployment with sandboxed tool execution for production-ready AI applications. Check out the [GitHub repo](https://github.com/agentscope-ai/agentscope-runtime).
 - **[2025-09]** **AgentScope Studio** is open-sourced now! Check out the [GitHub repo](https://github.com/agentscope-ai/agentscope-studio).
@@ -335,6 +336,7 @@ as_studio
   - [Long-Term Memory](https://doc.agentscope.io/tutorial/task_long_term_memory.html)
   - [Agent](https://doc.agentscope.io/tutorial/task_agent.html)
   - [Pipeline](https://doc.agentscope.io/tutorial/task_pipeline.html)
+  - [Plan](https://doc.agentscope.io/tutorial/task_plan.html)
   - [State/Session Management](https://doc.agentscope.io/tutorial/task_state.html)
   - [Agent Hooks](https://doc.agentscope.io/tutorial/task_hook.html)
   - [MCP](https://doc.agentscope.io/tutorial/task_mcp.html)
@@ -361,7 +363,7 @@ as_studio
     - [Structured Output](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/structured_output)
     - [Long-Term Memory](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/long_term_memory)
     - [Session with SQLite](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/session_with_sqlite)
-    - [Audio Response](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/audio_response)
+    - [Voice Agent](https://github.com/agentscope-ai/agentscope/tree/main/examples/agent/voice_agent)
 
 ## ⚖️ License
 
diff --git a/README_zh.md b/README_zh.md
@@ -104,6 +104,7 @@ AgentScope v1.0 新功能概览:
 | ...        |                                        |                                                                               |
 
 ## 📢 新闻
+- **[2025-09]** **Voice agent** 已上线！`ReActAgent` 已原生支持 Qwen-Omni 和 GPT-Audio 系列模型，欢迎查看 [样例](https://github.com/agentscope-ai/agentscope/tree/main/examples/agent/voice_agent) 和 [roadmap](https://github.com/agentscope-ai/agentscope/issues/773)。
 - **[2025-09]** 一个全新功能强大的 📋**Plan** 模块已经上线 AgentScope！查看[文档](https://doc.agentscope.io/zh_CN/tutorial/task_plan.html)了解更多详情。
 - **[2025-09]** **AgentScope Runtime** 现已开源！支持沙盒化工具执行的高效智能体部署，助力打造生产级AI应用。查看 [GitHub 仓库](https://github.com/agentscope-ai/agentscope-runtime)。
 - **[2025-09]** **AgentScope Studio** 现已开源！查看 [GitHub 仓库](https://github.com/agentscope-ai/agentscope-studio)。
@@ -337,6 +338,7 @@ as_studio
   - [长期记忆](https://doc.agentscope.io/zh_CN/tutorial/task_long_term_memory.html)
   - [智能体](https://doc.agentscope.io/zh_CN/tutorial/task_agent.html)
   - [管道（Pipeline）](https://doc.agentscope.io/zh_CN/tutorial/task_pipeline.html)
+  - [计划](https://doc.agentscope.io/zh_CN/tutorial/task_plan.html)
   - [状态/会话管理](https://doc.agentscope.io/zh_CN/tutorial/task_state.html)
   - [智能体钩子函数](https://doc.agentscope.io/zh_CN/tutorial/task_hook.html)
   - [MCP](https://doc.agentscope.io/zh_CN/tutorial/task_mcp.html)
@@ -363,7 +365,7 @@ as_studio
     - [结构化输出](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/structured_output)
     - [长期记忆](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/long_term_memory)
     - [基于DB的会话管理](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/session_with_sqlite)
-    - [语音回复](https://github.com/agentscope-ai/agentscope/tree/main/examples/functionality/audio_response)
+    - [语音智能体](https://github.com/agentscope-ai/agentscope/tree/main/examples/agent/voice_agent)
 
 ## ⚖️ 许可
 
diff --git a/examples/agent/voice_agent/README.md b/examples/agent/voice_agent/README.md
@@ -0,0 +1,20 @@
+# Voice Agent
+
+> This is experimental functionality in AgentScope.
+
+This example demonstrates how to create a voice agent using AgentScope with Qwen-Omni model, featuring both text and audio output capabilities.
+
+> **Note**:
+>  - Qwen-Omni may not generate tool calls when the audio output is enabled.
+>  - This example supports DashScope `Qwen-Omni` and OpenAI `GPT-4o Audio` models. You can change model by modifying the `model` parameter in `main.py`.
+>  - We haven't tested vLLM yet. Contributions are welcome!
+
+## Quick Start
+
+Ensure you have installed agentscope and set ``DASHSCOPE_API_KEY`` in your environment variables.
+
+Run the following commands to set up and run the example:
+
+```bash
+python main.py
+```
diff --git a/examples/agent/voice_agent/main.py b/examples/agent/voice_agent/main.py
diff --git a/examples/functionality/audio_response/README.md b/examples/functionality/audio_response/README.md
diff --git a/src/agentscope/agent/_agent_base.py b/src/agentscope/agent/_agent_base.py
@@ -12,7 +12,14 @@
 from ._agent_meta import _AgentMeta
 from .._logging import logger
 from ..module import StateModule
-from ..message import Msg, ContentBlock
+from ..message import (
+    Msg,
+    AudioBlock,
+    ToolUseBlock,
+    ToolResultBlock,
+    ImageBlock,
+    VideoBlock,
+)
 from ..types import AgentHookTypes
 
 
@@ -146,7 +153,9 @@ def __init__(self) -> None:
         self._instance_pre_observe_hooks = OrderedDict()
         self._instance_post_observe_hooks = OrderedDict()
 
-        # The prefix used in streaming printing
+        # The prefix used in streaming printing, which will save the
+        # accumulated text and audio streaming data for each message id.
+        # e.g. {"text": "xxx", "audio": (stream_obj, "{base64_data}")}
         self._stream_prefix = {}
 
         # The subscribers that will receive the reply message by their
@@ -158,9 +167,6 @@ def __init__(self) -> None:
         # output of the agent, e.g., in a production environment.
         self._disable_console_output: bool = False
 
-        # Audio related
-        self._audio_infra = {}
-
     async def observe(self, msg: Msg | list[Msg] | None) -> None:
         """Receive the given message(s) without generating a reply.
 
@@ -194,99 +200,172 @@ async def print(self, msg: Msg, last: bool = True) -> None:
         if self._disable_console_output:
             return
 
-        audio_prefix_name = msg.id + "-audio"
+        # The accumulated textual content to print, including the text blocks
+        # and the thinking blocks
         thinking_and_text_to_print = []
 
         for block in msg.get_content_blocks():
             if block["type"] == "audio":
-                self._process_audio_block(block, audio_prefix_name)
-            elif block["type"] in ["text", "thinking"]:
-                self._process_text_block(
-                    block,
-                    msg,
-                    thinking_and_text_to_print,
+                self._process_audio_block(msg.id, block)
+
+            elif block["type"] == "text":
+                self._print_text_block(
+                    msg.id,
+                    name_prefix=msg.name,
+                    text_content=block["text"],
+                    thinking_and_text_to_print=thinking_and_text_to_print,
+                )
+
+            elif block["type"] == "thinking":
+                self._print_text_block(
+                    msg.id,
+                    name_prefix=f"{msg.name}(thinking)",
+                    text_content=block["thinking"],
+                    thinking_and_text_to_print=thinking_and_text_to_print,
                 )
-            elif last:
-                self._process_last_block(block, msg)
 
-        if last:
-            self._cleanup_resources(msg, audio_prefix_name)
+            elif last:
+                self._print_last_block(block, msg)
+
+        # Clean up resources if this is the last message in streaming
+        if last and msg.id in self._stream_prefix:
+            if "audio" in self._stream_prefix[msg.id]:
+                player, _ = self._stream_prefix[msg.id]["audio"]
+                # Close the miniaudio player
+                player.close()
+            stream_prefix = self._stream_prefix.pop(msg.id)
+            if "text" in stream_prefix and not stream_prefix["text"].endswith(
+                "\n",
+            ):
+                print()
 
     def _process_audio_block(
         self,
-        block: ContentBlock,
-        audio_prefix_name: str,
+        msg_id: str,
+        audio_block: AudioBlock,
     ) -> None:
         """Process audio block content.
 
         Args:
-            block: The audio content block
-            audio_prefix_name: Unique identifier for the audio stream
+            msg_id (`str`):
+                The unique identifier of the message
+            audio_block (`AudioBlock`):
+                The audio content block
         """
-        audio_prefix = self._stream_prefix.get(audio_prefix_name, "")
-        if not audio_prefix:
-            self._initialize_audio_stream()
+        if "source" not in audio_block:
+            raise ValueError(
+                "The audio block must contain the 'source' field.",
+            )
 
-        wav_bytes = base64.b64decode(
-            block["source"]["data"][len(audio_prefix) :],
-        )
-        audio_np = np.frombuffer(wav_bytes, dtype=np.int16)
-        self._audio_infra["stream"].write(audio_np.tobytes())
-        self._stream_prefix[audio_prefix_name] = block["source"]["data"]
-
-    def _initialize_audio_stream(self) -> None:
-        """Initialize PyAudio stream with default settings."""
-        import pyaudio
-
-        self._audio_infra["pyaudio"] = pyaudio.PyAudio()
-        # TODO: make it configurable
-        self._audio_infra["stream"] = self._audio_infra["pyaudio"].open(
-            format=pyaudio.paInt16,
-            channels=1,
-            rate=24000,
-            output=True,
-        )
+        if audio_block["source"]["type"] == "url":
+            # TODO: maybe download and play the audio from the URL?
+            print(json.dumps(audio_block, indent=4, ensure_ascii=False))
+
+        elif audio_block["source"]["type"] == "base64":
+            data = audio_block["source"]["data"]
+
+            if msg_id not in self._stream_prefix:
+                self._stream_prefix[msg_id] = {}
+
+            audio_prefix = self._stream_prefix[msg_id].get("audio", None)
+
+            import sounddevice as sd
+
+            # The player and the prefix data is cached for streaming audio
+            if audio_prefix:
+                player, audio_prefix_data = audio_prefix
+            else:
+                player = sd.OutputStream(
+                    samplerate=24000,
+                    channels=1,
+                    dtype=np.float32,
+                    blocksize=1024,
+                    latency="low",
+                )
+                player.start()
+                audio_prefix_data = ""
+
+            # play the audio data
+            new_audio_data = data[len(audio_prefix_data) :]
+            if new_audio_data:
+                audio_bytes = base64.b64decode(new_audio_data)
+                audio_np = np.frombuffer(audio_bytes, dtype=np.int16)
+                audio_float = audio_np.astype(np.float32) / 32768.0
+
+                # Write to the audio output stream
+                player.write(audio_float)
+
+            # save the player and the prefix data
+            self._stream_prefix[msg_id]["audio"] = (
+                player,
+                data,
+            )
+
+        else:
+            raise ValueError(
+                "Unsupported audio source type: "
+                f"{audio_block['source']['type']}",
+            )
 
-    def _process_text_block(
+    def _print_text_block(
         self,
-        block: ContentBlock,
-        msg: Msg,
-        thinking_and_text_to_print: list,
+        msg_id: str,
+        name_prefix: str,
+        text_content: str,
+        thinking_and_text_to_print: list[str],
     ) -> None:
-        """Process text and thinking blocks.
+        """Print the text block and thinking block content.
 
         Args:
-            block: The text content block
-            msg: Message object containing the block
-            thinking_and_text_to_print: List to store formatted text
+            msg_id (`str`):
+                The unique identifier of the message
+            name_prefix (`str`):
+                The prefix for the message, e.g. "{name}: " for text block and
+                "{name}(thinking): " for thinking block.
+            text_content (`str`):
+                The textual content to be printed.
+            thinking_and_text_to_print (`list[str]`):
+                A list of textual content to be printed together. Here we
+                gather the text and thinking blocks to print them together.
         """
-        block_type = block["type"]
-        format_prefix = "" if block_type == "text" else "(thinking)"
-
         thinking_and_text_to_print.append(
-            f"{msg.name}{format_prefix}: {block[block_type]}",
+            f"{name_prefix}: {text_content}",
         )
-
+        # The accumulated text and thinking blocks to print
         to_print = "\n".join(thinking_and_text_to_print)
-        prefix = self._stream_prefix.get(msg.id, "")
 
-        if len(to_print) > len(prefix):
-            print(to_print[len(prefix) :], end="")
-            self._stream_prefix[msg.id] = to_print
+        # The text prefix that has been printed
+        if msg_id not in self._stream_prefix:
+            self._stream_prefix[msg_id] = {}
+
+        text_prefix = self._stream_prefix[msg_id].get("text", "")
+
+        # Only print when there is new text content
+        if len(to_print) > len(text_prefix):
+            print(to_print[len(text_prefix) :], end="")
 
-    def _process_last_block(self, block: ContentBlock, msg: Msg) -> None:
-        """Process blocks of types other than audio, text, or thinking.
+            # Save the printed text prefix
+            self._stream_prefix[msg_id]["text"] = to_print
+
+    def _print_last_block(
+        self,
+        block: ToolUseBlock | ToolResultBlock | ImageBlock | VideoBlock,
+        msg: Msg,
+    ) -> None:
+        """Process and print the last content block, and the block type
+        is not audio, text, or thinking.
 
         Args:
-            block: The content block
-            msg: Message object containing the block
+            block (`ToolUseBlock | ToolResultBlock | ImageBlock | VideoBlock`):
+                The content block to be printed
+            msg (`Msg`):
+                The message object
         """
-        if block["type"] == "audio":
-            return
+        text_prefix = self._stream_prefix.get(msg.id, {}).get("text", "")
 
-        prefix = self._stream_prefix.get(msg.id, "")
-        if prefix:
-            print_newline = "" if prefix.endswith("\n") else "\n"
+        if text_prefix:
+            # Add a newline to separate from previous text content
+            print_newline = "" if text_prefix.endswith("\n") else "\n"
             print(
                 f"{print_newline}"
                 f"{json.dumps(block, indent=4, ensure_ascii=False)}",
@@ -297,32 +376,6 @@ def _process_last_block(self, block: ContentBlock, msg: Msg) -> None:
                 f" {json.dumps(block, indent=4, ensure_ascii=False)}",
             )
 
-    def _cleanup_resources(self, msg: Msg, audio_prefix_name: str) -> None:
-        """Clean up audio resources and handle final printing.
-
-        Args:
-            msg: Message object
-            audio_prefix_name: Identifier for the audio stream
-        """
-        if audio_prefix_name in self._stream_prefix:
-            self._cleanup_audio_resources()
-
-        if msg.id in self._stream_prefix:
-            last_prefix = self._stream_prefix.pop(msg.id)
-            if not last_prefix.endswith("\n"):
-                print()
-
-    def _cleanup_audio_resources(self) -> None:
-        """Clean up audio stream and PyAudio resources."""
-        if "stream" in self._audio_infra:
-            self._audio_infra["stream"].stop_stream()
-            self._audio_infra["stream"].close()
-
-        if "pyaudio" in self._audio_infra:
-            self._audio_infra["pyaudio"].terminate()
-
-        self._audio_infra.clear()
-
     async def __call__(self, *args: Any, **kwargs: Any) -> Msg:
         """Call the reply function with the given arguments."""
         self._reply_id = shortuuid.uuid()