Streaming Sound (Server Authoritative)
Streaming speech is server-authoritative in local CLI runtime. Frontend playback is presentational only and does not control turn release.Overview
- Audio bytes are streamed through
POST /audio. - The server tracks per-stream metadata (
stream_id, speaker, bytes, format, done flag). - On final chunk, the server computes expected playback duration from PCM metadata.
- The server schedules completion and queues
SpeechFinished/AudioEndeditself. - Frontend no longer sends
audio_donefor authority.
Duration Math
For PCM streams:duration_seconds = total_audio_bytes / (sample_rate_hz * channels * bytes_per_sample) / playback_speed
For the default local pipeline:
sample_rate_hz = 24000channels = 1bytes_per_sample = 2
Data Flow
API Contract (Local Runtime)
POST /audio
Request body fields:
stream_id(string)seq(number)data(base64 PCM payload)done(boolean)speech_text(optional string, usually on final chunk)sample_rate_hz(optional number, default 24000)channels(optional number, default 1)bytes_per_sample(optional number, default 2)playback_speed(optional number, default 1.0)
GET /spectate/ws
Receives:
audio_chunkevents (for frontend playback)speecheventsplayback_doneevents- spectator state snapshots
No-Audio Mode
No-audio agents are supported by the same authority model:- claim turn
- finalize speech with
total_audio_bytes = 0 - server resolves duration as zero and releases on server path immediately
Why This Improves Roblox Parity
- Turn ownership/release is server-authoritative.
- Client playback status is not authoritative.
- Completion timing is deterministic from server-known stream data.
- Transport remains explicit, but authority boundaries match Roblox-style server control.