Conversation Dynamics¶

How messages flow between you, the browser, the hub, and the AI agent — and who controls what at each step.

The Players¶

Component	Role
Browser	Your interface. Records your voice, plays agent audio, shows status.
Hub	The traffic controller (port 3460). Routes audio between browser, STT/TTS services, and agents.
Whisper	Speech-to-text. Runs on your GPU. Turns your voice into text.
Kokoro	Text-to-speech. Runs on your GPU. Turns agent text into voice.
Agent	A Claude Code session running in tmux. Does the actual work.
MCP Server	Bridge between the agent and the hub. Exposes `converse()` as a tool the agent can call.

The Converse Cycle¶

Everything revolves around the converse() tool. The agent calls it to speak and (optionally) listen. Here's one full cycle:

The agent calls converse("Hello!", wait_for_response=true). The MCP server forwards this to the hub, which shows a thinking indicator in the browser, synthesizes the text via Kokoro TTS, and sends the MP3 to the browser to play. Once playback finishes, the hub signals the browser to start recording. The user's audio is sent back to the hub, transcribed by Whisper STT, and the resulting text is returned to the agent.

Who Controls What¶

This is the key insight: the agent drives the conversation, not the browser or the hub.

The hub doesn't decide when to listen or when to speak. It waits for the agent to call converse(). The browser doesn't decide when to record. It waits for the hub to send a listening signal. The flow is always:

Agent decides to speak → calls converse()
Hub orchestrates → TTS, audio delivery, STT
Result returns to agent → agent decides what to do next

The Gap Between Calls¶

After converse() returns the user's transcribed text, the agent goes off to do work — reading files, running commands, writing code. During this time:

The agent is not calling converse()
The hub has no visibility into what the agent is doing
The browser shows whatever the last status was

This gap is where status indicators can become inaccurate. The hub sent "done" after the last converse cycle completed, so the browser may show "Ready" — even though the agent is actively working.

Message Types¶

The hub communicates with the browser via WebSocket messages:

Message	Direction	Meaning
`thinking`	Hub → Browser	Agent received input, processing
`assistant_text`	Hub → Browser	Agent's text response (shown in chat)
`status`	Hub → Browser	Status bar update ("Speaking...", "Transcribing...")
`audio`	Hub → Browser	MP3 audio to play
`listening`	Hub → Browser	Ready for user to record
`done`	Hub → Browser	Converse cycle complete
`session_ended`	Hub → Browser	Agent said goodbye, session closing
`playback_done`	Browser → Hub	Audio finished playing
`audio`	Browser → Hub	User's recorded audio
`text`	Browser → Hub	User's typed message

The `wait_for_response` Parameter¶

The agent controls whether it wants to hear back:

wait_for_response=true (default): Full cycle. Agent speaks, then listens for user response. Returns the user's transcribed text.
wait_for_response=false: Fire-and-forget. Agent speaks but doesn't listen. The hub plays the audio and moves on. Used for status updates, acknowledgments, or goodbye messages.

The `goodbye` Parameter¶

When the agent is done with the conversation:

goodbye=true: After the audio plays, the hub sends session_ended to the browser. The browser auto-closes the session after 3 seconds.
goodbye=false (default): Normal operation. Session stays open.

State Timeline¶

Here's what happens during a typical interaction, showing what each component is doing:

Time    Agent               Hub                 Browser
─────   ─────               ───                 ───────
  0     calls converse()    receives request     idle
  1     waiting...          sends "thinking"     shows "Thinking..."
  2     waiting...          calls Kokoro TTS     shows "Thinking..."
  3     waiting...          sends audio          plays audio
  4     waiting...          waits for playback   playing...
  5     waiting...          receives done        sends playback_done
  6     waiting...          sends "listening"    shows "Listening..."
  7     waiting...          waiting for audio    recording...
  8     waiting...          receives audio       shows "Processing..."
  9     waiting...          calls Whisper STT    shows "Processing..."
 10     gets text back      sends "done"         shows "Ready"
 11     reading files...    idle                 shows "Ready" ← gap
 12     running commands... idle                 shows "Ready" ← gap
 13     writing code...     idle                 shows "Ready" ← gap
 14     calls converse()    receives request     cycle repeats

Notice the gap from time 11–13: the agent is working, but the browser shows "Ready" because the hub has no way to know the agent is still active.

Design Implications¶

The agent-driven model means:

The agent must always call converse() to keep the conversation going. If the agent stops calling converse(), the user has no way to communicate.
Status accuracy depends on the converse cycle. Between cycles, the hub can only guess at the agent's state.
Audio queuing exists but is hidden. If the user records audio while the agent is busy, it gets queued in the hub and delivered on the next converse(wait_for_response=true) call. But the browser doesn't show this queue.
The agent could report its own status. A future report_status() tool could let the agent tell the hub what it's doing between converse cycles, making the status indicators accurate.