Voice Implementation

HUMA-0.1 Only

Overview

HUMA voice mode enables real-time spoken conversations between users and agents. Users speak into their microphone, the agent listens, thinks, and responds with natural speech — all with support for interruptions and multi-party rooms.

Listen

Real-time speech recognition with automatic end-of-turn detection

Speak

High-quality text-to-speech with customizable voices

Interrupt

Users can cut the agent off mid-speech naturally

Requirement: Voice mode uses Daily.co rooms for audio transport. Your client application joins a Daily.co room, and the HUMA agent joins the same room automatically. You need a Daily.co account.

How It Works

Create a voice-enabled agent

Set voice.enabled: true and routerType: 'turn-taking' in agent metadata.

Connect via WebSocket

Connect to HUMA with Socket.IO using your agent ID and API key.

Join a Daily.co room

Send join-daily-room with a room URL. The agent joins automatically and starts listening.

Talk

The agent hears your speech, processes it through the AI pipeline, and responds via text-to-speech in the room.

Agent Setup

Voice requires two things in your agent metadata: voice.enabled: true and routerType: 'turn-taking'.

Create Voice Agent

const agent = await fetch(`${API_URL}/api/agents`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
  body: JSON.stringify({
    name: 'Voice Assistant',
    agentType: 'HUMA-0.1',
    metadata: {
      className: 'Assistant',
      personality: 'Friendly and helpful voice assistant.',
      instructions: 'Keep responses concise, 1-2 sentences. Speak naturally.',
      routerType: 'turn-taking',
      tools: [],
      voice: {
        enabled: true,
        voiceId: 'EXAVITQu4vr4xnSDxMaL'  // Optional: ElevenLabs voice ID
      }
    }
  })
}).then(r => r.json());

voice.enabled

Required. Must be true. Without this, the agent cannot join voice rooms.

voice.voiceId

Optional. ElevenLabs voice ID for text-to-speech. Browse voices at elevenlabs.io/voice-library.

routerType

Required. Must be 'turn-taking' for voice mode. This enables the speak tool and voice-aware turn management.

tools

The speak tool is added automatically in voice mode. You don't need to define it — just add any custom tools your agent needs.

Connection & Room Join

1. Connect WebSocket

Socket.IO

import { io } from 'socket.io-client';

const socket = io(API_URL, {
  query: { agentId: agent.id, apiKey: API_KEY },
  transports: ['websocket']
});

socket.on('connect', () => {
  console.log('Connected to HUMA');
});

2. Join Daily.co room (client-side audio)

Daily.co SDK

import DailyIframe from '@daily-co/daily-js';

const callFrame = DailyIframe.createCallObject({
  audioSource: true,
  videoSource: false,  // Voice only
});

await callFrame.join({ url: roomUrl });

3. Tell HUMA to join the same room

Join Room

socket.emit('message', {
  type: 'join-daily-room',
  roomUrl: 'https://your-domain.daily.co/room-name'
});

4. Leave room

Leave Room

socket.emit('message', { type: 'leave-daily-room' });
await callFrame.leave();

Events Reference

All events are received on the event channel via Socket.IO:

Event Type	Description	Key Fields
voice-status	Agent voice state changed	status: 'joined' \| 'left' \| 'error', roomUrl?, error?
transcript	Speech was transcribed	text, isFinal, speaker
speak-status	Agent speech state changed	status: 'started' \| 'finished' \| 'interrupted' \| 'failed', commandId
vad	Voice activity detection	isSpeaking, speaker

Event Handler

socket.on('event', (event) => {
  switch (event.type) {
    case 'voice-status':
      if (event.status === 'joined') {
        console.log('Agent joined room:', event.roomUrl);
      } else if (event.status === 'error') {
        console.error('Voice error:', event.error);
      }
      break;

    case 'transcript':
      if (event.isFinal) {
        console.log(`${event.speaker}: ${event.text}`);
      }
      break;

    case 'speak-status':
      if (event.status === 'started') {
        // Show speaking indicator
      } else if (event.status === 'finished') {
        // Back to listening
      } else if (event.status === 'interrupted') {
        // User interrupted the agent
      }
      break;
  }
});

Full Example

Complete Voice Integration

import { io } from 'socket.io-client';
import DailyIframe from '@daily-co/daily-js';

const API_URL = 'https://api.humalike.tech';
const API_KEY = 'ak_your_api_key';

// 1. Create voice-enabled agent
const agent = await fetch(`${API_URL}/api/agents`, {
  method: 'POST',
  headers: { 'Content-Type': 'application/json', 'X-API-Key': API_KEY },
  body: JSON.stringify({
    name: 'Voice Assistant',
    agentType: 'HUMA-0.1',
    metadata: {
      className: 'Assistant',
      personality: 'Friendly voice assistant.',
      instructions: 'Respond concisely in 1-2 sentences.',
      routerType: 'turn-taking',
      tools: [],
      voice: { enabled: true }
    }
  })
}).then(r => r.json());

// 2. Connect WebSocket
const socket = io(API_URL, {
  query: { agentId: agent.id, apiKey: API_KEY },
  transports: ['websocket']
});

// 3. Handle events
socket.on('event', (event) => {
  switch (event.type) {
    case 'voice-status':
      console.log('Voice:', event.status, event.roomUrl || '');
      break;
    case 'transcript':
      if (event.isFinal) console.log(`${event.speaker}: ${event.text}`);
      break;
    case 'speak-status':
      console.log('Speak:', event.status);
      break;
  }
});

// 4. Join Daily.co room (client audio)
const callFrame = DailyIframe.createCallObject({
  audioSource: true,
  videoSource: false,
});
await callFrame.join({ url: roomUrl });

// 5. Tell HUMA to join the same room
socket.emit('message', {
  type: 'join-daily-room',
  roomUrl: roomUrl
});

// 6. Cleanup
function disconnect() {
  socket.emit('message', { type: 'leave-daily-room' });
  callFrame.leave();
  socket.disconnect();
}

Best Practices

Keep responses short

Set instructions like "respond in 1-2 sentences". Long responses feel unnatural in voice and increase latency.

Use headphones during development

Without headphones, the microphone can pick up the agent's speech and create feedback loops. Daily.co has echo cancellation but headphones are safest.

Show visual feedback

Use speak-status events to show when the agent is speaking, thinking, or listening. The ~1s silence between user speech and agent response needs a visual indicator.

Request microphone permission early

Browsers require explicit microphone permission. Request it before joining the room to avoid a confusing UX.

Always clean up

Send leave-daily-room and disconnect the Daily.co call frame before closing. This ensures the agent leaves the room cleanly.