Voice Implementation
HUMA-0.1 OnlyOverview
HUMA voice mode enables real-time spoken conversations between users and agents. Users speak into their microphone, the agent listens, thinks, and responds with natural speech — all with support for interruptions and multi-party rooms.
Real-time speech recognition with automatic end-of-turn detection
High-quality text-to-speech with customizable voices
Users can cut the agent off mid-speech naturally
How It Works
Create a voice-enabled agent
Set voice.enabled: true and routerType: 'turn-taking' in agent metadata.
Connect via WebSocket
Connect to HUMA with Socket.IO using your agent ID and API key.
Join a Daily.co room
Send join-daily-room with a room URL. The agent joins automatically and starts listening.
Talk
The agent hears your speech, processes it through the AI pipeline, and responds via text-to-speech in the room.
Agent Setup
Voice requires two things in your agent metadata: voice.enabled: true and routerType: 'turn-taking'.
const agent = await fetch(`${API_URL}/api/agents`, {
method: 'POST',
headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
body: JSON.stringify({
name: 'Voice Assistant',
agentType: 'HUMA-0.1',
metadata: {
className: 'Assistant',
personality: 'Friendly and helpful voice assistant.',
instructions: 'Keep responses concise, 1-2 sentences. Speak naturally.',
routerType: 'turn-taking',
tools: [],
voice: {
enabled: true,
voiceId: 'EXAVITQu4vr4xnSDxMaL' // Optional: ElevenLabs voice ID
}
}
})
}).then(r => r.json());Required. Must be true. Without this, the agent cannot join voice rooms.
Optional. ElevenLabs voice ID for text-to-speech. Browse voices at elevenlabs.io/voice-library.
Required. Must be 'turn-taking' for voice mode. This enables the speak tool and voice-aware turn management.
The speak tool is added automatically in voice mode. You don't need to define it — just add any custom tools your agent needs.
Connection & Room Join
1. Connect WebSocket
import { io } from 'socket.io-client';
const socket = io(API_URL, {
query: { agentId: agent.id, apiKey: API_KEY },
transports: ['websocket']
});
socket.on('connect', () => {
console.log('Connected to HUMA');
});2. Join Daily.co room (client-side audio)
import DailyIframe from '@daily-co/daily-js';
const callFrame = DailyIframe.createCallObject({
audioSource: true,
videoSource: false, // Voice only
});
await callFrame.join({ url: roomUrl });3. Tell HUMA to join the same room
socket.emit('message', {
type: 'join-daily-room',
roomUrl: 'https://your-domain.daily.co/room-name'
});4. Leave room
socket.emit('message', { type: 'leave-daily-room' });
await callFrame.leave();Events Reference
All events are received on the event channel via Socket.IO:
| Event Type | Description | Key Fields |
|---|---|---|
| voice-status | Agent voice state changed | status: 'joined' | 'left' | 'error', roomUrl?, error? |
| transcript | Speech was transcribed | text, isFinal, speaker |
| speak-status | Agent speech state changed | status: 'started' | 'finished' | 'interrupted' | 'failed', commandId |
| vad | Voice activity detection | isSpeaking, speaker |
socket.on('event', (event) => {
switch (event.type) {
case 'voice-status':
if (event.status === 'joined') {
console.log('Agent joined room:', event.roomUrl);
} else if (event.status === 'error') {
console.error('Voice error:', event.error);
}
break;
case 'transcript':
if (event.isFinal) {
console.log(`${event.speaker}: ${event.text}`);
}
break;
case 'speak-status':
if (event.status === 'started') {
// Show speaking indicator
} else if (event.status === 'finished') {
// Back to listening
} else if (event.status === 'interrupted') {
// User interrupted the agent
}
break;
}
});Full Example
import { io } from 'socket.io-client';
import DailyIframe from '@daily-co/daily-js';
const API_URL = 'https://api.humalike.tech';
const API_KEY = 'ak_your_api_key';
// 1. Create voice-enabled agent
const agent = await fetch(`${API_URL}/api/agents`, {
method: 'POST',
headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
body: JSON.stringify({
name: 'Voice Assistant',
agentType: 'HUMA-0.1',
metadata: {
className: 'Assistant',
personality: 'Friendly voice assistant.',
instructions: 'Respond concisely in 1-2 sentences.',
routerType: 'turn-taking',
tools: [],
voice: { enabled: true }
}
})
}).then(r => r.json());
// 2. Connect WebSocket
const socket = io(API_URL, {
query: { agentId: agent.id, apiKey: API_KEY },
transports: ['websocket']
});
// 3. Handle events
socket.on('event', (event) => {
switch (event.type) {
case 'voice-status':
console.log('Voice:', event.status, event.roomUrl || '');
break;
case 'transcript':
if (event.isFinal) console.log(`${event.speaker}: ${event.text}`);
break;
case 'speak-status':
console.log('Speak:', event.status);
break;
}
});
// 4. Join Daily.co room (client audio)
const callFrame = DailyIframe.createCallObject({
audioSource: true,
videoSource: false,
});
await callFrame.join({ url: roomUrl });
// 5. Tell HUMA to join the same room
socket.emit('message', {
type: 'join-daily-room',
roomUrl: roomUrl
});
// 6. Cleanup
function disconnect() {
socket.emit('message', { type: 'leave-daily-room' });
callFrame.leave();
socket.disconnect();
}Best Practices
Keep responses short
Set instructions like "respond in 1-2 sentences". Long responses feel unnatural in voice and increase latency.
Use headphones during development
Without headphones, the microphone can pick up the agent's speech and create feedback loops. Daily.co has echo cancellation but headphones are safest.
Show visual feedback
Use speak-status events to show when the agent is speaking, thinking, or listening. The ~1s silence between user speech and agent response needs a visual indicator.
Request microphone permission early
Browsers require explicit microphone permission. Request it before joining the room to avoid a confusing UX.
Always clean up
Send leave-daily-room and disconnect the Daily.co call frame before closing. This ensures the agent leaves the room cleanly.