🚀 快速安装
复制以下命令并运行,立即安装此 Skill:
npx skills add https://skills.sh/aradotso/trending-skills/voicebox-voice-synthesis
💡 提示:需要 Node.js 和 NPM
Voicebox 语音合成工作室
技能来自 ara.so — Daily 2026 Skills 系列。
Voicebox 是一个本地优先、开源的语音克隆和 TTS 工作室 —— 一个可以自托管的 ElevenLabs 替代方案。它完全运行在您的本地机器上(macOS MLX/Metal、Windows/Linux CUDA、CPU 降级),在 localhost:17493 上暴露一个 REST API,并附带 5 个 TTS 引擎、23 种语言、后期处理效果和一个多轨故事编辑器。
安装
预编译二进制文件(推荐)
| 平台 | 链接 |
|---|---|
| macOS Apple Silicon | https://voicebox.sh/download/mac-arm |
| macOS Intel | https://voicebox.sh/download/mac-intel |
| Windows | https://voicebox.sh/download/windows |
| Docker | docker compose up |
Linux 需要从源代码构建:https://voicebox.sh/linux-install
从源代码构建
先决条件: Bun、Rust、Python 3.11+、Tauri 先决条件
git clone https://github.com/jamiepine/voicebox.git
cd voicebox
# 安装 just 任务运行器
brew install just # macOS
cargo install just # 任何平台
# 设置 Python venv + 所有依赖项
just setup
# 在开发模式下启动后端 + 桌面应用程序
just dev
# 列出所有可用命令
just --list
架构
| 层 | 技术 |
|---|---|
| 桌面应用程序 | Tauri (Rust) |
| 前端 | React + TypeScript + Tailwind CSS |
| 状态管理 | Zustand + React Query |
| 后端 | FastAPI (Python) 在端口 17493 上 |
| TTS 引擎 | Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA |
| 效果 | Pedalboard (Spotify) |
| 转录 | Whisper / Whisper Turbo |
| 推理 | MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) |
| 数据库 | SQLite |
Python FastAPI 后端处理所有机器学习推理。Tauri Rust shell 包装前端并管理后端进程的生命周期。即使在使用桌面应用程序时,API 也可以直接在 http://localhost:17493 访问。
REST API 参考
基础 URL:http://localhost:17493
交互式文档:http://localhost:17493/docs
生成语音
# 基本生成
curl -X POST http://localhost:17493/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Hello world, this is a voice clone.",
"profile_id": "abc123",
"language": "en"
}'
# 带引擎选择
curl -X POST http://localhost:17493/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Speak slowly and with gravitas.",
"profile_id": "abc123",
"language": "en",
"engine": "qwen3-tts"
}'
# 带副语言标签 (仅 Chatterbox Turbo)
curl -X POST http://localhost:17493/generate \
-H "Content-Type: application/json" \
-d '{
"text": "That is absolutely hilarious! [laugh] I cannot believe it.",
"profile_id": "abc123",
"engine": "chatterbox-turbo",
"language": "en"
}'
语音配置文件
# 列出所有配置文件
curl http://localhost:17493/profiles
# 创建新配置文件
curl -X POST http://localhost:17493/profiles \
-H "Content-Type: application/json" \
-d '{
"name": "Narrator",
"language": "en",
"description": "Deep narrative voice"
}'
# 向配置文件上传音频样本
curl -X POST http://localhost:17493/profiles/{profile_id}/samples \
-F "file=@/path/to/voice-sample.wav"
# 导出配置文件
curl http://localhost:17493/profiles/{profile_id}/export \
--output narrator-profile.zip
# 导入配置文件
curl -X POST http://localhost:17493/profiles/import \
-F "file=@narrator-profile.zip"
生成队列与状态
# 获取生成状态 (SSE 流)
curl -N http://localhost:17493/generate/{generation_id}/status
# 列出最近的生成记录
curl http://localhost:17493/generations
# 重试失败的生成
curl -X POST http://localhost:17493/generations/{generation_id}/retry
# 下载生成的音频
curl http://localhost:17493/generations/{generation_id}/audio \
--output output.wav
模型
# 列出可用模型和下载状态
curl http://localhost:17493/models
# 从 GPU 内存卸载模型(不删除)
curl -X POST http://localhost:17493/models/{model_id}/unload
TypeScript/JavaScript 集成
基本 TTS 客户端
const VOICEBOX_URL = process.env.VOICEBOX_API_URL ?? "http://localhost:17493";
interface GenerateRequest {
text: string;
profile_id: string;
language?: string;
engine?: "qwen3-tts" | "luxtts" | "chatterbox" | "chatterbox-turbo" | "tada";
}
interface GenerateResponse {
generation_id: string;
status: "queued" | "processing" | "complete" | "failed";
audio_url?: string;
}
async function generateSpeech(req: GenerateRequest): Promise<GenerateResponse> {
const response = await fetch(`${VOICEBOX_URL}/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(req),
});
if (!response.ok) {
throw new Error(`Voicebox API 错误: ${response.status} ${await response.text()}`);
}
return response.json();
}
// 使用示例
const result = await generateSpeech({
text: "Welcome to our application.",
profile_id: "abc123",
language: "en",
engine: "qwen3-tts",
});
console.log("生成 ID:", result.generation_id);
轮询完成状态
async function waitForGeneration(
generationId: string,
timeoutMs = 60_000
): Promise<string> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
const res = await fetch(`${VOICEBOX_URL}/generations/${generationId}`);
const data = await res.json();
if (data.status === "complete") {
return `${VOICEBOX_URL}/generations/${generationId}/audio`;
}
if (data.status === "failed") {
throw new Error(`生成失败: ${data.error}`);
}
await new Promise((r) => setTimeout(r, 1000));
}
throw new Error("生成超时");
}
使用 SSE 流式传输状态
function streamGenerationStatus(
generationId: string,
onStatus: (status: string) => void
): () => void {
const eventSource = new EventSource(
`${VOICEBOX_URL}/generate/${generationId}/status`
);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
onStatus(data.status);
if (data.status === "complete" || data.status === "failed") {
eventSource.close();
}
};
eventSource.onerror = () => eventSource.close();
// 返回清理函数
return () => eventSource.close();
}
// 使用示例
const cleanup = streamGenerationStatus("gen_abc123", (status) => {
console.log("状态更新:", status);
});
将音频下载为 Blob
async function downloadAudio(generationId: string): Promise<Blob> {
const response = await fetch(
`${VOICEBOX_URL}/generations/${generationId}/audio`
);
if (!response.ok) {
throw new Error(`下载音频失败: ${response.status}`);
}
return response.blob();
}
// 在浏览器中播放
async function playGeneratedAudio(generationId: string): Promise<void> {
const blob = await downloadAudio(generationId);
const url = URL.createObjectURL(blob);
const audio = new Audio(url);
audio.play();
audio.onended = () => URL.revokeObjectURL(url);
}
Python 集成
import httpx
import asyncio
VOICEBOX_URL = "http://localhost:17493"
async def generate_speech(
text: str,
profile_id: str,
language: str = "en",
engine: str = "qwen3-tts"
) -> bytes:
async with httpx.AsyncClient(timeout=120.0) as client:
# 提交生成任务
resp = await client.post(
f"{VOICEBOX_URL}/generate",
json={
"text": text,
"profile_id": profile_id,
"language": language,
"engine": engine,
}
)
resp.raise_for_status()
generation_id = resp.json()["generation_id"]
# 轮询直到完成
for _ in range(120):
status_resp = await client.get(
f"{VOICEBOX_URL}/generations/{generation_id}"
)
status_data = status_resp.json()
if status_data["status"] == "complete":
audio_resp = await client.get(
f"{VOICEBOX_URL}/generations/{generation_id}/audio"
)
return audio_resp.content
if status_data["status"] == "failed":
raise RuntimeError(f"生成失败: {status_data.get('error')}")
await asyncio.sleep(1.0)
raise TimeoutError("生成超时,已等待 120 秒")
# 使用示例
audio_bytes = asyncio.run(
generate_speech(
text="The quick brown fox jumps over the lazy dog.",
profile_id="your-profile-id",
language="en",
engine="chatterbox",
)
)
with open("output.wav", "wb") as f:
f.write(audio_bytes)
TTS 引擎选择指南
| 引擎 | 最佳用途 | 语言 | 显存 | 备注 |
|---|---|---|---|---|
qwen3-tts (0.6B/1.7B) |
质量 + 指令 | 10 种 | 中等 | 支持在文本中嵌入传递指令 |
luxtts |
快速的 CPU 生成 | 仅英语 | ~1GB | CPU 上 150 倍实时速度,48kHz |
chatterbox |
多语言覆盖 | 23 种 | 中等 | 阿拉伯语、印地语、斯瓦希里语、中日韩等 |
chatterbox-turbo |
富有表现力/情感 | 仅英语 | 低 (350M) | 使用 [laugh]、[sigh]、[gasp] 标签 |
tada (1B/3B) |
长篇连贯性 | 10 种 | 高 | 700 秒以上音频,HumeAI 模型 |
传递指令 (Qwen3-TTS)
直接在文本中嵌入自然语言指令:
await generateSpeech({
text: "(whisper) I have a secret to tell you.",
profile_id: "abc123",
engine: "qwen3-tts",
});
await generateSpeech({
text: "(speak slowly and clearly) Step one: open the application.",
profile_id: "abc123",
engine: "qwen3-tts",
});
副语言标签 (Chatterbox Turbo)
const tags = [
"[laugh]", "[chuckle]", "[gasp]", "[cough]",
"[sigh]", "[groan]", "[sniff]", "[shush]", "[clear throat]"
];
await generateSpeech({
text: "Oh really? [gasp] I had no idea! [laugh] That's incredible.",
profile_id: "abc123",
engine: "chatterbox-turbo",
});
环境与配置
# 自定义模型目录(在启动前设置)
export VOICEBOX_MODELS_DIR=/path/to/models
# 对于 AMD ROCm GPU(自动配置,但可以覆盖)
export HSA_OVERRIDE_GFX_VERSION=11.0.0
Docker 配置 (docker-compose.yml 覆盖):
services:
voicebox:
environment:
- VOICEBOX_MODELS_DIR=/models
volumes:
- /host/models:/models
ports:
- "17493:17493"
# 对于 NVIDIA GPU 透传:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
常见模式
语音配置文件创建流程
// 1. 创建配置文件
const profile = await fetch(`${VOICEBOX_URL}/profiles`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ name: "My Voice", language: "en" }),
}).then((r) => r.json());
// 2. 上传音频样本(WAV/MP3,理想情况下 5–30 秒的清晰语音)
const formData = new FormData();
formData.append("file", audioBlob, "sample.wav");
await fetch(`${VOICEBOX_URL}/profiles/${profile.id}/samples`, {
method: "POST",
body: formData,
});
// 3. 使用新配置文件生成语音
const gen = await generateSpeech({
text: "Testing my cloned voice.",
profile_id: profile.id,
});
带队列的批量生成
async function batchGenerate(
items: Array<{ text: string; profileId: string }>,
engine = "qwen3-tts"
): Promise<string[]> {
// 提交所有任务 — Voicebox 会串行处理以避免 GPU 争用
const submissions = await Promise.all(
items.map((item) =>
generateSpeech({ text: item.text, profile_id: item.profileId, engine })
)
);
// 等待所有任务完成
const audioUrls = await Promise.all(
submissions.map((s) => waitForGeneration(s.generation_id))
);
return audioUrls;
}
长文本(自动分块)
Voicebox 会在句子边界自动分块 —— 只需发送完整文本:
const longScript = `
第一章。晨雾笼罩着山谷...
// 支持最多 50,000 个字符
`;
await generateSpeech({
text: longScript,
profile_id: "narrator-profile-id",
engine: "tada", // 最适合长篇连贯性
language: "en",
});
故障排除
API 无响应
# 检查后端是否正在运行
curl http://localhost:17493/health
# 仅重启后端(开发模式)
just backend
# 查看日志
just logs
未检测到 GPU
# 检查检测到的后端
curl http://localhost:17493/system/info
# 强制 CPU 模式(启动前设置)
export VOICEBOX_FORCE_CPU=1
模型下载失败 / 速度慢
# 设置自定义模型目录,确保有足够空间
export VOICEBOX_MODELS_DIR=/path/with/space
just dev
# 通过 API 取消卡住的下载
curl -X DELETE http://localhost:17493/models/{model_id}/download
显存不足 — 卸载模型
# 列出已加载的模型
curl http://localhost:17493/models | jq '.[] | select(.loaded == true)'
# 卸载特定模型
curl -X POST http://localhost:17493/models/{model_id}/unload
音频质量问题
- 为语音样本使用 5–30 秒的干净、无噪声音频
- 多个样本可提高克隆质量 — 上传 3–5 个不同的句子
- 对于多语言克隆,使用
chatterbox引擎 - 为确保最佳效果,样本音频应为 16kHz+ 单声道 WAV
- 在英语中,使用
luxtts可获得最高输出质量(48kHz)
崩溃后生成任务卡在队列中
Voicebox 会在启动时自动恢复失效的生成任务。如果问题持续存在:
curl -X POST http://localhost:17493/generations/{generation_id}/retry
前端集成 (React 示例)
import { useState } from "react";
const VOICEBOX_URL = import.meta.env.VITE_VOICEBOX_URL ?? "http://localhost:17493";
export function VoiceGenerator({ profileId }: { profileId: string }) {
const [text, setText] = useState("");
const [audioUrl, setAudioUrl] = useState<string | null>(null);
const [loading, setLoading] = useState(false);
const handleGenerate = async () => {
setLoading(true);
try {
const res = await fetch(`${VOICEBOX_URL}/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, profile_id: profileId, language: "en" }),
});
const { generation_id } = await res.json();
// 轮询直到完成
let done = false;
while (!done) {
await new Promise((r) => setTimeout(r, 1000));
const statusRes = await fetch(`${VOICEBOX_URL}/generations/${generation_id}`);
const { status } = await statusRes.json();
if (status === "complete") {
setAudioUrl(`${VOICEBOX_URL}/generations/${generation_id}/audio`);
done = true;
} else if (status === "failed") {
throw new Error("生成失败");
}
}
} finally {
setLoading(false);
}
};
return (
<div>
<textarea value={text} onChange={(e) => setText(e.target.value)} />
<button onClick={handleGenerate} disabled={loading}>
{loading ? "生成中..." : "生成语音"}
</button>
{audioUrl && <audio controls src={audioUrl} />}
</div>
);
}
📄 原始文档
完整文档(英文):
https://skills.sh/aradotso/trending-skills/voicebox-voice-synthesis
💡 提示:点击上方链接查看 skills.sh 原始英文文档,方便对照翻译。

评论(0)