Clarification Needed: Achieving True Real-time "See-While-Hear" Experience

Hi team! First, I want to commend you on this impressive work.  

I noticed a potential discrepancy in the implementation that I'd like to clarify. The [README states](https://github.com/ictnlp/Stream-Omni/blob/4d38d4786e81250c222ee4f4f883dcfaa4096b73/README.md#L32) that **"Stream-Omni can offer users a seamless 'see-while-hear' experience."** However, the current [`/token-to-speech` API implementation](https://github.com/ictnlp/Stream-Omni/blob/4d38d4786e81250c222ee4f4f883dcfaa4096b73/CosyVoice/cosyvoice_worker.py#L40-L65) appears to be offline—users only receive audio after the entire token generation completes, rather than during streaming.  

To better align with the documented "see-while-hear" capability, could you either:  
1. Provide a demo of synchronous audio stream generation (Like LLaMa-Omni2), or  
2. Share guidance on implementing true real-time audio streaming?  

This would help clarify the expected user experience. Thanks for your time and consideration!  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification Needed: Achieving True Real-time "See-While-Hear" Experience #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Clarification Needed: Achieving True Real-time "See-While-Hear" Experience #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions