iOS Development¶
You are building a native iOS app that connects to the ClawMux as a client, replacing the browser UI.
How It Works¶
The iOS app connects to the same hub WebSocket as the browser. The hub doesn't care what client is connected. Your job is to implement the same state machine and audio handling, but in Swift/SwiftUI.
Reference Docs¶
Read these before writing any code:
| Document | What you'll learn |
|---|---|
| WebSocket Protocol | Every message type, the converse flow sequence, session object schema, REST API |
| UI Behavior | All states, button behaviors, toggles, audio handling |
| Hub Architecture | How the hub works, what it expects from clients |
Key Files¶
The app has two main source files:
ios/VoiceHub/VoiceHubViewModel.swift # All state: WebSocket, audio, recording, Live Activity
ios/VoiceHub/ContentView.swift # All UI: voice grid, session view, settings, debug
Supporting files:
ios/project.yml # XcodeGen project definition
ios/VoiceHub/Info.plist # Background modes, permissions, URL schemes
ios/VoiceHubShared/VoiceHubActivityAttributes.swift # ActivityKit (shared with widget)
ios/VoiceHubWidget/VoiceHubLiveActivity.swift # Dynamic Island + Lock Screen
Build & Deploy Workflow¶
The development cycle is: edit code, build, install, launch.
# Build (from ios/ directory)
xcodebuild -project VoiceHub.xcodeproj -scheme VoiceHub \
-destination 'id=DEVICE_ID' 2>&1 | grep -E '(BUILD|error:)'
# Install
xcrun devicectl device install app --device DEVICE_ID \
~/Library/Developer/Xcode/DerivedData/VoiceHub-*/Build/Products/Debug-iphoneos/VoiceHub.app
# Launch (phone must be unlocked)
xcrun devicectl device process launch --device DEVICE_ID com.zeul.voicehub
Find device ID with xcrun xctrace list devices.
After editing project.yml, regenerate with xcodegen generate before building.
SourceKit diagnostics: The Swift SourceKit language server reports false errors ("Cannot find type 'VoiceHubViewModel' in scope", etc.) due to stale indexing. Ignore these. Only trust xcodebuild output for real errors.
Architecture¶
Input Modes¶
The app supports three input modes, switchable via a mode pill in the session view:
- Auto - Mic opens automatically after agent speaks, VAD stops recording on silence
- PTT - Hold-to-talk with 4-direction drag gestures (see PTT Gestures below)
- Typing - Text-only, no audio
PTT Gestures¶
Four-direction gesture system on the mic button while recording:
| Gesture | Action | Audio sent to hub? |
|---|---|---|
| Swipe UP | Send audio immediately | Yes (as audio) |
| Swipe LEFT | Cancel recording, discard | No |
| Swipe RIGHT | Open keyboard with transcription | No (sent as text after editing) |
| Just release | Show inline transcript preview | No (user decides) |
Transcript preview is an intermediate state between the mic button and keyboard. After releasing: - Shows transcription spinner, then the recognized text - Tap the transcript text to open keyboard and edit - Tap send button to send as text immediately - Tap X to dismiss and return to normal mic - Press mic again to discard preview and start new recording
Keyboard return: Dismissing the keyboard (X button) returns to transcript preview if text exists, rather than fully dismissing.
Direction hints during recording: Left shows "Cancel" label, right shows "Aa" keyboard hint, mic label updates to reflect current drag direction.
Auto mode parallel transcription: When audio is sent to the hub (via swipe-up or auto-mode), a parallel /api/transcribe call fires to show the user what they said while waiting for the agent response. Cleared when the hub echoes back user_text.
Audio Session¶
.playAndRecord, mode: .spokenAudio,
options: [.defaultToSpeaker, .allowBluetooth, .allowBluetoothA2DP, .allowBluetoothHFP, .mixWithOthers]
Background Keepalive¶
Dual-layer keepalive to prevent iOS suspension:
- AVAudioEngine with continuous input tap (primary - active audio processing)
- Silent AVAudioPlayer loop (secondary - ensures audio session stays active)
Both start on background entry, stop on foreground return. Interruption handler re-activates session and restarts engine on .ended.
Recording¶
Uses AVAudioRecorder (not AVAudioEngine) for actual recording: 16kHz, mono, 16-bit PCM to a temp file. VAD runs via a separate AVAudioEngine input tap that monitors RMS levels.
Playback¶
TTS audio arrives as base64 MP3 via WebSocket audio messages. Decoded and played via AVAudioPlayer. Audio for non-active sessions is buffered and played on switch.
Live Activity¶
ActivityKit with VoiceHubActivityAttributes. Started/updated/ended from ViewModel at state transitions. Per-mode toggle (auto and PTT only).
Connection Details¶
- WebSocket:
wss://{hostname}:{port}/ws - REST API:
https://{hostname}:{port}/api/... - Transcription:
POST /api/transcribe- accepts raw audio bytes, returns{"text": "..."} - Hub sends
pingevery 30s. On reconnect, hub sendssession_listwith full state.
Settings¶
Per-mode settings pages (Auto, PTT, Typing) with relevant toggles for each mode:
- Input controls (auto-record, VAD, interrupt, record-while-thinking)
- Sounds (thinking, listening cue, processing cue, session ready)
- Haptics (recording, playback, send, session events)
- Notifications (background agent response alerts)
- Live Activity toggle
Audio Format¶
- Playback: Hub sends base64-encoded MP3
- Recording: App records WAV (16kHz PCM), hub sends to Whisper which accepts it
- Transcription preview:
POST /api/transcribewith raw audio bytes for PTT text preview