Hi team! First, I want to commend you on this impressive work.
I noticed a potential discrepancy in the implementation that I'd like to clarify. The README states that "Stream-Omni can offer users a seamless 'see-while-hear' experience." However, the current /token-to-speech API implementation appears to be offline—users only receive audio after the entire token generation completes, rather than during streaming.
To better align with the documented "see-while-hear" capability, could you either:
- Provide a demo of synchronous audio stream generation (Like LLaMa-Omni2), or
- Share guidance on implementing true real-time audio streaming?
This would help clarify the expected user experience. Thanks for your time and consideration!
Hi team! First, I want to commend you on this impressive work.
I noticed a potential discrepancy in the implementation that I'd like to clarify. The README states that "Stream-Omni can offer users a seamless 'see-while-hear' experience." However, the current
/token-to-speechAPI implementation appears to be offline—users only receive audio after the entire token generation completes, rather than during streaming.To better align with the documented "see-while-hear" capability, could you either:
This would help clarify the expected user experience. Thanks for your time and consideration!