# voicebox **Repository Path**: chong00li/voicebox ## Basic Information - **Project Name**: voicebox - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-07 - **Last Updated**: 2026-05-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Voicebox

The open-source AI voice studio.
Clone any voice. Generate speech. Dictate into any app. Talk to agents in voices you own.
The full voice I/O stack, running locally on your machine.

voicebox.sh • Docs • Download • Features • API • Troubleshooting

Click the image above to watch the demo video on voicebox.sh

Voicebox Screenshot 2

Voicebox Screenshot 3

## What is Voicebox? Voicebox is a **local-first AI voice studio** — a free and open-source alternative to **ElevenLabs** and **WisprFlow** in one app. Clone voices from a few seconds of audio, generate speech in 23 languages across 7 TTS engines, dictate into any text field with a global hotkey, and give any MCP-aware AI agent a voice of your choosing. The two cloud incumbents sit on opposite halves of the voice I/O loop — ElevenLabs on output, WisprFlow on input. Voicebox does both, bridges them with a bundled local LLM for refinement and per-profile personas, and runs the whole thing on your machine. - **Complete privacy** — models, voice data, and captures never leave your machine - **7 TTS engines** — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro - **Voice cloning and preset voices** — zero-shot cloning from a reference sample, or 50+ curated preset voices via Kokoro and Qwen CustomVoice - **23 languages** — from English to Arabic, Japanese, Hindi, Swahili, and more - **Post-processing effects** — pitch shift, reverb, delay, chorus, compression, and filters - **Expressive speech** — paralinguistic tags like `[laugh]`, `[sigh]`, `[gasp]` via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice - **Unlimited length** — auto-chunking with crossfade for scripts, articles, and chapters - **Stories editor** — multi-track timeline for conversations, podcasts, and narratives - **Voice input** — global dictation hotkey with push-to-talk and toggle modes, accessibility-verified auto-paste on macOS, in-app mic on every text field, Whisper-based STT - **Agent voice output** — one tool call (`voicebox.speak`) and any MCP-aware agent (Claude Code, Cursor, Cline) speaks to you in a voice you've cloned - **Voice personalities** — attach a free-form persona to any voice profile, then Compose, Rewrite, or Respond via a bundled local LLM — agents can invoke the same modes over MCP - **API-first** — REST API plus a built-in MCP server for integrating voice I/O into your own apps and agents - **Native performance** — built with Tauri (Rust), not Electron - **Runs everywhere** — macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker --- ## Download | Platform | Download | | --------------------- | ------------------------------------------------------ | | macOS (Apple Silicon) | [Download DMG](https://voicebox.sh/download/mac-arm) | | macOS (Intel) | [Download DMG](https://voicebox.sh/download/mac-intel) | | Windows | [Download MSI](https://voicebox.sh/download/windows) | | Docker | `docker compose up` | > **[View all binaries →](https://github.com/jamiepine/voicebox/releases/latest)** > **Linux** — Pre-built binaries are not yet available. See [voicebox.sh/linux-install](https://voicebox.sh/linux-install) for build-from-source instructions. > **Having trouble?** See the [Troubleshooting Guide](docs/content/docs/overview/troubleshooting.mdx) for common install, generation, model-download, and GPU issues. --- ## Features ### Multi-Engine Voice Cloning Seven TTS engines with different strengths, switchable per-generation: | Engine | Languages | Strengths | | --------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------- | | **Qwen3-TTS** (0.6B / 1.7B) | 10 | High-quality multilingual cloning, delivery instructions ("speak slowly", "whisper") | | **Qwen CustomVoice** | 10 | 9 curated preset voices with natural-language delivery control — no reference audio required | | **LuxTTS** | English | Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU | | **Chatterbox Multilingual** | 23 | Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more | | **Chatterbox Turbo** | English | Fast 350M model with paralinguistic emotion/sound tags | | **TADA** (1B / 3B) | 10 | HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment | | **Kokoro** | 8 | 50 curated preset voices, tiny 82M model, fast CPU inference | ### Emotions & Paralinguistic Tags Only **Chatterbox Turbo** interprets paralinguistic tags like `[laugh]` and `[sigh]`. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them literally as text. With **Chatterbox Turbo** selected, type `/` in the text input to open the tag inserter and add expressive tags inline with speech: `[laugh]` `[chuckle]` `[gasp]` `[cough]` `[sigh]` `[groan]` `[sniff]` `[shush]` `[clear throat]` ### Post-Processing Effects 8 audio effects powered by Spotify's `pedalboard` library. Apply after generation, preview in real time, build reusable presets. | Effect | Description | | ---------------- | --------------------------------------------- | | Pitch Shift | Up or down by up to 12 semitones | | Reverb | Configurable room size, damping, wet/dry mix | | Delay | Echo with adjustable time, feedback, and mix | | Chorus / Flanger | Modulated delay for metallic or lush textures | | Compressor | Dynamic range compression | | Gain | Volume adjustment (-40 to +40 dB) | | High-Pass Filter | Remove low frequencies | | Low-Pass Filter | Remove high frequencies | Ships with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults. ### Unlimited Generation Length Text is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines. - Configurable auto-chunking limit (100–5,000 chars) - Crossfade slider (0–200ms) for smooth transitions - Max text length: 50,000 characters - Smart splitting respects abbreviations, CJK punctuation, and `[tags]` ### Generation Versions Every generation supports multiple versions with provenance tracking: - **Original** — clean TTS output, always preserved - **Effects versions** — apply different effects chains from any source version - **Takes** — regenerate with a new seed for variation - **Source tracking** — each version records its lineage - **Favorites** — star generations for quick access ### Async Generation Queue Generation is non-blocking. Submit and immediately start typing the next one. - Serial execution queue prevents GPU contention - Real-time SSE status streaming - Failed generations can be retried - Stale generations from crashes auto-recover on startup ### Voice Profile Management - Create profiles from audio files or record directly in-app - Import/export profiles to share or back up - Multi-sample support for higher quality cloning - Per-profile default effects chains - Organize with descriptions and language tags ### Stories Editor Multi-voice timeline editor for conversations, podcasts, and narratives. - Multi-track composition with drag-and-drop - Inline audio trimming and splitting - Auto-playback with synchronized playhead - Version pinning per track clip ### Global Dictation & Voice Input The other half of the voice I/O loop. Hold a hotkey anywhere on your system, speak, release — on macOS the transcript pastes straight into the focused text field. Or hit the mic on any Voicebox text input and dictate directly into the app. - **Configurable chord bindings** — hold-to-speak and tap-to-toggle chords, each rebindable in the in-app chord picker. Holding push-to-talk and tapping `Space` mid-hold upgrades into a toggle session without a gap in audio - **Target-aware paste (macOS)** — accessibility-verified injection into the focused text field, with atomic clipboard save/restore so your clipboard isn't clobbered - **First-run permissions UX** — in-app gates walk you through the macOS Accessibility and Input Monitoring grants with deep-links to System Settings - **In-app mic button** on every Voicebox text field — generation form, profile descriptions, story titles, anywhere you'd type - **LLM refinement** — optional cleanup of ums, stutters, and false starts before paste - **On-screen pill** — floating overlay surfacing `recording`, `transcribing`, `refining`, and `speaking` states. Same pill agents use when they speak to you, so there's one mental model for both directions of the loop ### Speech-to-Text Voicebox runs OpenAI Whisper for transcription — the same model that backs dictation, the Captures tab, and the `/transcribe` API. Running on MLX (Apple Silicon) or PyTorch (CUDA / ROCm / DirectML / CPU) depending on your platform. | Size | Notes | | ----------------------------- | -------------------------------------------------- | | Base / Small / Medium / Large | Standard Whisper quality ladder | | Turbo | ~8x faster than Whisper Large, minimal quality loss | More engines (Parakeet v3, Qwen3-ASR) are planned — see [Roadmap](#roadmap). ### Captures Every dictation, in-app recording, and uploaded audio file lands in the Captures tab — original audio paired with transcript, always preserved. - **Replay, re-transcribe, refine** — rerun STT with any Whisper size, or re-run the raw transcript through the local LLM with different flags (filler cleanup, self-correction removal, technical-term preservation) - **Edit inline** — tweak the transcript and save on blur - **Play as voice profile** — turn any capture into speech with a cloned voice, one click - **Promote to voice sample** — use a capture's audio + transcript as a reference sample on any voice profile - **Local capture storage** — original audio and transcript stay in your Voicebox data directory, with a folder shortcut in Settings ### Agent Voice Output Every agent gets a voice. One tool call and any MCP-aware agent can speak to you in a voice you've cloned — task completions, questions, notifications. The same pill that surfaces during dictation surfaces during agent speech, so you always see what's coming out of your machine. ```ts // In any MCP-aware agent: await voicebox.speak({ text: "Deploy complete.", profile: "Morgan", }); ``` Also exposed as `POST /speak` for anything that doesn't speak MCP — ACP, A2A, shell scripts, custom harnesses. - **Bidirectional pill** — `recording`, `transcribing`, `refining`, and `speaking` are all states of the same OS-level overlay, so dictation and agent speech share one surface - **Per-agent voice binding** — in **Settings → MCP**, pin Claude Code to Morgan and Cursor to Scarlett so you can tell which agent is talking without looking. Each client's `last_seen_at` timestamp confirms the install actually took - **Always visible** — no silent background TTS; every agent-initiated speak surfaces the pill with the voice profile name for the full duration - **HTTP + stdio transports** — install as a URL in Claude Code / Cursor / Windsurf / VS Code MCP, or point stdio-only clients at the bundled `voicebox-mcp` binary ### Voice Personalities Attach a free-form personality to any voice profile — who this voice is, how they speak, what they care about. Two actions appear on the generate box when a personality is set, powered by a bundled Qwen3 LLM running entirely locally. - **Compose** — a shuffle button that drops a fresh in-character line into the textarea; edit and speak, or click again for a different take - **Speak in character** — a toggle that routes your input text through the personality LLM to be rewritten in their voice before TTS Agents can reach the same rewrite path over MCP by passing `personality: true` to `voicebox.speak`, turning the tool into a text-in → personality-LLM → TTS pipeline. The same LLM backs dictation's refinement step — one LLM in the app, one model cache, one GPU-memory footprint. **Local LLM options:** Qwen3 0.6B / 1.7B / 4B, sharing the TTS runtime (MLX on Apple Silicon, PyTorch elsewhere). Use cases: agent dev loops (dictate a question, hear the answer in a cloned voice), interactive characters for games and narrative tools, speech assistance for people who can't speak in their original voice. ### Model Management - Per-model unload to free GPU memory without deleting downloads - Custom models directory via `VOICEBOX_MODELS_DIR` - Model folder migration with progress tracking - Download cancel/clear UI ### GPU Support | Platform | Backend | Notes | | ------------------------ | -------------- | ---------------------------------------------- | | macOS (Apple Silicon) | MLX (Metal) | 4-5x faster via Neural Engine | | Windows / Linux (NVIDIA) | PyTorch (CUDA) | Auto-downloads CUDA binary from within the app | | Linux (AMD) | PyTorch (ROCm) | Auto-configures HSA_OVERRIDE_GFX_VERSION | | Windows (any GPU) | DirectML | Universal Windows GPU support | | Intel Arc | IPEX/XPU | Intel discrete GPU acceleration | | Any | CPU | Works everywhere, just slower | --- ## API Voicebox exposes a REST API for integrating voice I/O into your own apps and agents. ```bash # Generate speech curl -X POST http://127.0.0.1:17493/generate \ -H "Content-Type: application/json" \ -d '{"text": "Hello world", "profile_id": "abc123", "language": "en"}' # Agent voice output — any app or script can speak in a cloned voice curl -X POST http://127.0.0.1:17493/speak \ -H "Content-Type: application/json" \ -H "X-Voicebox-Client-Id: my-script" \ -d '{"text": "Deploy complete.", "profile": "Morgan"}' # Transcribe an audio file curl -X POST http://127.0.0.1:17493/transcribe \ -F "audio=@recording.wav" \ -F "model=whisper-turbo" # List voice profiles curl http://127.0.0.1:17493/profiles ``` `POST /speak` accepts `profile` as a name (case-insensitive) or id, and resolves via the same precedence as the MCP tool: explicit arg → per-client binding → `capture_settings.default_playback_voice_id`. ### MCP server Voicebox ships a built-in **Model Context Protocol** server so any MCP-aware agent (Claude Code, Cursor, Windsurf, Cline, VS Code MCP extensions) can speak, transcribe, and browse captures and profiles. **Claude Code one-liner:** ``` claude mcp add voicebox \ --transport http \ --url http://127.0.0.1:17493/mcp \ --header "X-Voicebox-Client-Id: claude-code" ``` **Any HTTP MCP client** (Cursor, Windsurf, VS Code, etc.): ```json { "mcpServers": { "voicebox": { "url": "http://127.0.0.1:17493/mcp", "headers": { "X-Voicebox-Client-Id": "cursor" } } } } ``` **Stdio fallback** for clients that don't speak HTTP MCP — point at the bundled `voicebox-mcp` binary inside the app: ```json { "mcpServers": { "voicebox": { "command": "/Applications/Voicebox.app/Contents/MacOS/voicebox-mcp", "env": { "VOICEBOX_CLIENT_ID": "claude-desktop" } } } } ``` Four tools ship: `voicebox.speak`, `voicebox.transcribe`, `voicebox.list_captures`, `voicebox.list_profiles`. Per-client voice bindings are managed in **Voicebox → Settings → MCP**. See the [full MCP guide](docs/content/docs/overview/mcp-server.mdx) for tool signatures, resolution precedence, the speaking-pill contract, and security notes. ```ts // In any MCP-aware agent: await voicebox.speak({ text: "Tests passing. Ready to merge.", profile: "Morgan", // optional — falls back to the per-client binding personality: true, // optional — rewrites text through the profile's personality LLM first }); ``` **Use cases:** agent dev loops (voice in, voice out), game dialogue, podcast production, accessibility tools, voice assistants, content automation. Full API documentation available at `http://127.0.0.1:17493/docs`. --- ## Tech Stack | Layer | Technology | | ------------- | ------------------------------------------------------------------------------- | | Desktop App | Tauri (Rust) | | Frontend | React, TypeScript, Tailwind CSS | | State | Zustand, React Query | | Backend | FastAPI (Python) | | TTS Engines | Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro | | STT | Whisper / Whisper Turbo (PyTorch or MLX) | | Local LLM | Qwen3 (0.6B / 1.7B / 4B), shared runtime with TTS / STT | | MCP Server | FastMCP mounted at `/mcp` (Streamable HTTP) + bundled stdio shim binary | | Native Shim | Rust (inside Tauri) for global hotkey, paste injection, focus introspection | | Effects | Pedalboard (Spotify) | | Inference | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) | | Database | SQLite | | Audio | WaveSurfer.js, librosa | --- ## Roadmap | Feature | Description | | ---------------------------------- | ------------------------------------------------------------------------ | | **Windows / Linux auto-paste** | Dictation paste parity — `SendInput` on Windows, `uinput` / AT-SPI on Linux | | **STT engine expansion** | Parakeet v3 and Qwen3-ASR joining Whisper — 50+ languages, better non-English quality | | **Pipeline routing** | Configurable source → transform → sink chains with webhook + MCP sinks and a preset editor | | **Streaming transcription** | WebSocket `/transcribe/stream` for partial transcripts as you speak | | **End-to-end speech LLMs** | Moshi, GLM-4-Voice, Qwen2.5 Omni — real voice-to-voice, no text between | | **Voice Design** | Create new voices from text descriptions | | **Long-form capture** | Dual-stream recorder (mic + system audio) with summary LLM transform | | **Platform sinks** | Apple Notes, Obsidian, and other opt-in integrations | | **Plugin architecture** | Extend with custom models, transforms, and sinks | | **Mobile companion** | Control Voicebox from your phone | For the **full engineering status, open-issue triage, and prioritized work queue**, see [`docs/PROJECT_STATUS.md`](docs/PROJECT_STATUS.md) — a living document that tracks what's shipped, what's in-flight, candidate TTS engines under evaluation, and why we've accepted or backlogged specific integrations. --- ## Development See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed setup and contribution guidelines. ### Quick Start ```bash git clone https://github.com/jamiepine/voicebox.git cd voicebox just setup # creates Python venv, installs all deps just dev # starts backend + desktop app ``` Install [just](https://github.com/casey/just): `brew install just` or `cargo install just`. Run `just --list` to see all commands. **Prerequisites:** [Bun](https://bun.sh), [Rust](https://rustup.rs), [Python 3.11+](https://python.org), [Tauri Prerequisites](https://v2.tauri.app/start/prerequisites/), and [Xcode](https://developer.apple.com/xcode/) on macOS. The repo ships a pre-wired `.mcp.json` at the root — running Claude Code inside this checkout picks up the Voicebox MCP tools automatically once the dev app is running. ### Building Locally ```bash just build # Build CPU server binary + Tauri app just build-local # (Windows) Build CPU + CUDA server binaries + Tauri app ``` ### Adding New Voice Models The multi-engine architecture makes adding new TTS engines straightforward. A [step-by-step guide](docs/content/docs/developer/tts-engines.mdx) covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling. The guide is optimized for AI coding agents. An [agent skill](.agents/skills/add-tts-engine/SKILL.md) can pick up a model name and handle the entire integration autonomously — you just test the build locally. ### Project Structure ``` voicebox/ ├── app/ # Shared React frontend ├── tauri/ # Desktop app (Tauri + Rust) ├── web/ # Web deployment ├── backend/ # Python FastAPI server ├── landing/ # Marketing website └── scripts/ # Build & release scripts ``` --- ## Contributing Contributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. 1. Fork the repo 2. Create a feature branch 3. Make your changes 4. Submit a PR ## Security Found a security vulnerability? Please report it responsibly. See [SECURITY.md](SECURITY.md) for details. --- ## License MIT License — see [LICENSE](LICENSE) for details. ---

voicebox.sh