🚀 快速安装

复制以下命令并运行,立即安装此 Skill:

npx skills add https://skills.sh/aradotso/trending-skills/voicebox-voice-synthesis

💡 提示:需要 Node.js 和 NPM

Voicebox 语音合成工作室

技能来自 ara.so — Daily 2026 Skills 系列。

Voicebox 是一个本地优先、开源的语音克隆和 TTS 工作室 —— 一个可以自托管的 ElevenLabs 替代方案。它完全运行在您的本地机器上(macOS MLX/Metal、Windows/Linux CUDA、CPU 降级),在 localhost:17493 上暴露一个 REST API,并附带 5 个 TTS 引擎、23 种语言、后期处理效果和一个多轨故事编辑器。


安装

预编译二进制文件(推荐)

平台 链接
macOS Apple Silicon https://voicebox.sh/download/mac-arm
macOS Intel https://voicebox.sh/download/mac-intel
Windows https://voicebox.sh/download/windows
Docker docker compose up

Linux 需要从源代码构建:https://voicebox.sh/linux-install

从源代码构建

先决条件: BunRustPython 3.11+、Tauri 先决条件

git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# 安装 just 任务运行器
brew install just        # macOS
cargo install just       # 任何平台

# 设置 Python venv + 所有依赖项
just setup

# 在开发模式下启动后端 + 桌面应用程序
just dev
# 列出所有可用命令
just --list

架构

技术
桌面应用程序 Tauri (Rust)
前端 React + TypeScript + Tailwind CSS
状态管理 Zustand + React Query
后端 FastAPI (Python) 在端口 17493 上
TTS 引擎 Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA
效果 Pedalboard (Spotify)
转录 Whisper / Whisper Turbo
推理 MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
数据库 SQLite

Python FastAPI 后端处理所有机器学习推理。Tauri Rust shell 包装前端并管理后端进程的生命周期。即使在使用桌面应用程序时,API 也可以直接在 http://localhost:17493 访问。


REST API 参考

基础 URL:http://localhost:17493
交互式文档:http://localhost:17493/docs

生成语音

# 基本生成
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello world, this is a voice clone.",
    "profile_id": "abc123",
    "language": "en"
  }'

# 带引擎选择
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Speak slowly and with gravitas.",
    "profile_id": "abc123",
    "language": "en",
    "engine": "qwen3-tts"
  }'

# 带副语言标签 (仅 Chatterbox Turbo)
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "That is absolutely hilarious! [laugh] I cannot believe it.",
    "profile_id": "abc123",
    "engine": "chatterbox-turbo",
    "language": "en"
  }'

语音配置文件

# 列出所有配置文件
curl http://localhost:17493/profiles

# 创建新配置文件
curl -X POST http://localhost:17493/profiles \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Narrator",
    "language": "en",
    "description": "Deep narrative voice"
  }'

# 向配置文件上传音频样本
curl -X POST http://localhost:17493/profiles/{profile_id}/samples \
  -F "file=@/path/to/voice-sample.wav"

# 导出配置文件
curl http://localhost:17493/profiles/{profile_id}/export \
  --output narrator-profile.zip

# 导入配置文件
curl -X POST http://localhost:17493/profiles/import \
  -F "file=@narrator-profile.zip"

生成队列与状态

# 获取生成状态 (SSE 流)
curl -N http://localhost:17493/generate/{generation_id}/status

# 列出最近的生成记录
curl http://localhost:17493/generations

# 重试失败的生成
curl -X POST http://localhost:17493/generations/{generation_id}/retry

# 下载生成的音频
curl http://localhost:17493/generations/{generation_id}/audio \
  --output output.wav

模型

# 列出可用模型和下载状态
curl http://localhost:17493/models

# 从 GPU 内存卸载模型(不删除)
curl -X POST http://localhost:17493/models/{model_id}/unload

TypeScript/JavaScript 集成

基本 TTS 客户端

const VOICEBOX_URL = process.env.VOICEBOX_API_URL ?? "http://localhost:17493";

interface GenerateRequest {
  text: string;
  profile_id: string;
  language?: string;
  engine?: "qwen3-tts" | "luxtts" | "chatterbox" | "chatterbox-turbo" | "tada";
}

interface GenerateResponse {
  generation_id: string;
  status: "queued" | "processing" | "complete" | "failed";
  audio_url?: string;
}

async function generateSpeech(req: GenerateRequest): Promise<GenerateResponse> {
  const response = await fetch(`${VOICEBOX_URL}/generate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(req),
  });

  if (!response.ok) {
    throw new Error(`Voicebox API 错误: ${response.status} ${await response.text()}`);
  }

  return response.json();
}

// 使用示例
const result = await generateSpeech({
  text: "Welcome to our application.",
  profile_id: "abc123",
  language: "en",
  engine: "qwen3-tts",
});

console.log("生成 ID:", result.generation_id);

轮询完成状态

async function waitForGeneration(
  generationId: string,
  timeoutMs = 60_000
): Promise<string> {
  const start = Date.now();

  while (Date.now() - start < timeoutMs) {
    const res = await fetch(`${VOICEBOX_URL}/generations/${generationId}`);
    const data = await res.json();

    if (data.status === "complete") {
      return `${VOICEBOX_URL}/generations/${generationId}/audio`;
    }
    if (data.status === "failed") {
      throw new Error(`生成失败: ${data.error}`);
    }

    await new Promise((r) => setTimeout(r, 1000));
  }

  throw new Error("生成超时");
}

使用 SSE 流式传输状态

function streamGenerationStatus(
  generationId: string,
  onStatus: (status: string) => void
): () => void {
  const eventSource = new EventSource(
    `${VOICEBOX_URL}/generate/${generationId}/status`
  );

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    onStatus(data.status);

    if (data.status === "complete" || data.status === "failed") {
      eventSource.close();
    }
  };

  eventSource.onerror = () => eventSource.close();

  // 返回清理函数
  return () => eventSource.close();
}

// 使用示例
const cleanup = streamGenerationStatus("gen_abc123", (status) => {
  console.log("状态更新:", status);
});

将音频下载为 Blob

async function downloadAudio(generationId: string): Promise<Blob> {
  const response = await fetch(
    `${VOICEBOX_URL}/generations/${generationId}/audio`
  );

  if (!response.ok) {
    throw new Error(`下载音频失败: ${response.status}`);
  }

  return response.blob();
}

// 在浏览器中播放
async function playGeneratedAudio(generationId: string): Promise<void> {
  const blob = await downloadAudio(generationId);
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();
  audio.onended = () => URL.revokeObjectURL(url);
}

Python 集成

import httpx
import asyncio

VOICEBOX_URL = "http://localhost:17493"

async def generate_speech(
    text: str,
    profile_id: str,
    language: str = "en",
    engine: str = "qwen3-tts"
) -> bytes:
    async with httpx.AsyncClient(timeout=120.0) as client:
        # 提交生成任务
        resp = await client.post(
            f"{VOICEBOX_URL}/generate",
            json={
                "text": text,
                "profile_id": profile_id,
                "language": language,
                "engine": engine,
            }
        )
        resp.raise_for_status()
        generation_id = resp.json()["generation_id"]

        # 轮询直到完成
        for _ in range(120):
            status_resp = await client.get(
                f"{VOICEBOX_URL}/generations/{generation_id}"
            )
            status_data = status_resp.json()

            if status_data["status"] == "complete":
                audio_resp = await client.get(
                    f"{VOICEBOX_URL}/generations/{generation_id}/audio"
                )
                return audio_resp.content

            if status_data["status"] == "failed":
                raise RuntimeError(f"生成失败: {status_data.get('error')}")

            await asyncio.sleep(1.0)

        raise TimeoutError("生成超时,已等待 120 秒")


# 使用示例
audio_bytes = asyncio.run(
    generate_speech(
        text="The quick brown fox jumps over the lazy dog.",
        profile_id="your-profile-id",
        language="en",
        engine="chatterbox",
    )
)

with open("output.wav", "wb") as f:
    f.write(audio_bytes)

TTS 引擎选择指南

引擎 最佳用途 语言 显存 备注
qwen3-tts (0.6B/1.7B) 质量 + 指令 10 种 中等 支持在文本中嵌入传递指令
luxtts 快速的 CPU 生成 仅英语 ~1GB CPU 上 150 倍实时速度,48kHz
chatterbox 多语言覆盖 23 种 中等 阿拉伯语、印地语、斯瓦希里语、中日韩等
chatterbox-turbo 富有表现力/情感 仅英语 低 (350M) 使用 [laugh][sigh][gasp] 标签
tada (1B/3B) 长篇连贯性 10 种 700 秒以上音频,HumeAI 模型

传递指令 (Qwen3-TTS)

直接在文本中嵌入自然语言指令:

await generateSpeech({
  text: "(whisper) I have a secret to tell you.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

await generateSpeech({
  text: "(speak slowly and clearly) Step one: open the application.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

副语言标签 (Chatterbox Turbo)

const tags = [
  "[laugh]", "[chuckle]", "[gasp]", "[cough]",
  "[sigh]", "[groan]", "[sniff]", "[shush]", "[clear throat]"
];

await generateSpeech({
  text: "Oh really? [gasp] I had no idea! [laugh] That's incredible.",
  profile_id: "abc123",
  engine: "chatterbox-turbo",
});

环境与配置

# 自定义模型目录(在启动前设置)
export VOICEBOX_MODELS_DIR=/path/to/models

# 对于 AMD ROCm GPU(自动配置,但可以覆盖)
export HSA_OVERRIDE_GFX_VERSION=11.0.0

Docker 配置 (docker-compose.yml 覆盖):

services:
  voicebox:
    environment:
      - VOICEBOX_MODELS_DIR=/models
    volumes:
      - /host/models:/models
    ports:
      - "17493:17493"
    # 对于 NVIDIA GPU 透传:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

常见模式

语音配置文件创建流程

// 1. 创建配置文件
const profile = await fetch(`${VOICEBOX_URL}/profiles`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ name: "My Voice", language: "en" }),
}).then((r) => r.json());

// 2. 上传音频样本(WAV/MP3,理想情况下 5–30 秒的清晰语音)
const formData = new FormData();
formData.append("file", audioBlob, "sample.wav");

await fetch(`${VOICEBOX_URL}/profiles/${profile.id}/samples`, {
  method: "POST",
  body: formData,
});

// 3. 使用新配置文件生成语音
const gen = await generateSpeech({
  text: "Testing my cloned voice.",
  profile_id: profile.id,
});

带队列的批量生成

async function batchGenerate(
  items: Array<{ text: string; profileId: string }>,
  engine = "qwen3-tts"
): Promise<string[]> {
  // 提交所有任务 — Voicebox 会串行处理以避免 GPU 争用
  const submissions = await Promise.all(
    items.map((item) =>
      generateSpeech({ text: item.text, profile_id: item.profileId, engine })
    )
  );

  // 等待所有任务完成
  const audioUrls = await Promise.all(
    submissions.map((s) => waitForGeneration(s.generation_id))
  );

  return audioUrls;
}

长文本(自动分块)

Voicebox 会在句子边界自动分块 —— 只需发送完整文本:

const longScript = `
  第一章。晨雾笼罩着山谷...
  // 支持最多 50,000 个字符
`;

await generateSpeech({
  text: longScript,
  profile_id: "narrator-profile-id",
  engine: "tada", // 最适合长篇连贯性
  language: "en",
});

故障排除

API 无响应

# 检查后端是否正在运行
curl http://localhost:17493/health

# 仅重启后端(开发模式)
just backend

# 查看日志
just logs

未检测到 GPU

# 检查检测到的后端
curl http://localhost:17493/system/info

# 强制 CPU 模式(启动前设置)
export VOICEBOX_FORCE_CPU=1

模型下载失败 / 速度慢

# 设置自定义模型目录,确保有足够空间
export VOICEBOX_MODELS_DIR=/path/with/space
just dev

# 通过 API 取消卡住的下载
curl -X DELETE http://localhost:17493/models/{model_id}/download

显存不足 — 卸载模型

# 列出已加载的模型
curl http://localhost:17493/models | jq '.[] | select(.loaded == true)'

# 卸载特定模型
curl -X POST http://localhost:17493/models/{model_id}/unload

音频质量问题

  • 为语音样本使用 5–30 秒的干净、无噪声音频
  • 多个样本可提高克隆质量 — 上传 3–5 个不同的句子
  • 对于多语言克隆,使用 chatterbox 引擎
  • 为确保最佳效果,样本音频应为 16kHz+ 单声道 WAV
  • 在英语中,使用 luxtts 可获得最高输出质量(48kHz)

崩溃后生成任务卡在队列中

Voicebox 会在启动时自动恢复失效的生成任务。如果问题持续存在:

curl -X POST http://localhost:17493/generations/{generation_id}/retry

前端集成 (React 示例)

import { useState } from "react";

const VOICEBOX_URL = import.meta.env.VITE_VOICEBOX_URL ?? "http://localhost:17493";

export function VoiceGenerator({ profileId }: { profileId: string }) {
  const [text, setText] = useState("");
  const [audioUrl, setAudioUrl] = useState<string | null>(null);
  const [loading, setLoading] = useState(false);

  const handleGenerate = async () => {
    setLoading(true);
    try {
      const res = await fetch(`${VOICEBOX_URL}/generate`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, profile_id: profileId, language: "en" }),
      });
      const { generation_id } = await res.json();

      // 轮询直到完成
      let done = false;
      while (!done) {
        await new Promise((r) => setTimeout(r, 1000));
        const statusRes = await fetch(`${VOICEBOX_URL}/generations/${generation_id}`);
        const { status } = await statusRes.json();
        if (status === "complete") {
          setAudioUrl(`${VOICEBOX_URL}/generations/${generation_id}/audio`);
          done = true;
        } else if (status === "failed") {
          throw new Error("生成失败");
        }
      }
    } finally {
      setLoading(false);
    }
  };

  return (
    <div>
      <textarea value={text} onChange={(e) => setText(e.target.value)} />
      <button onClick={handleGenerate} disabled={loading}>
        {loading ? "生成中..." : "生成语音"}
      </button>
      {audioUrl && <audio controls src={audioUrl} />}
    </div>
  );
}

📄 原始文档

完整文档(英文):

https://skills.sh/aradotso/trending-skills/voicebox-voice-synthesis

💡 提示:点击上方链接查看 skills.sh 原始英文文档,方便对照翻译。

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。