🚀 快速安装

复制以下命令并运行，立即安装此 Skill：

npx skills add https://skills.sh/aradotso/trending-skills/voicebox-voice-synthesis

💡 提示：需要 Node.js 和 NPM

Voicebox 语音合成工作室

技能来自 ara.so — Daily 2026 Skills 系列。

Voicebox 是一个本地优先、开源的语音克隆和 TTS 工作室 —— 一个可以自托管的 ElevenLabs 替代方案。它完全运行在您的本地机器上（macOS MLX/Metal、Windows/Linux CUDA、CPU 降级），在 localhost:17493 上暴露一个 REST API，并附带 5 个 TTS 引擎、23 种语言、后期处理效果和一个多轨故事编辑器。

安装

预编译二进制文件（推荐）

平台	链接
macOS Apple Silicon	https://voicebox.sh/download/mac-arm
macOS Intel	https://voicebox.sh/download/mac-intel
Windows	https://voicebox.sh/download/windows
Docker	`docker compose up`

Linux 需要从源代码构建：https://voicebox.sh/linux-install

从源代码构建

先决条件： Bun、Rust、Python 3.11+、Tauri 先决条件

git clone https://github.com/jamiepine/voicebox.git
cd voicebox

# 安装 just 任务运行器
brew install just        # macOS
cargo install just       # 任何平台

# 设置 Python venv + 所有依赖项
just setup

# 在开发模式下启动后端 + 桌面应用程序
just dev

# 列出所有可用命令
just --list

架构

层	技术
桌面应用程序	Tauri (Rust)
前端	React + TypeScript + Tailwind CSS
状态管理	Zustand + React Query
后端	FastAPI (Python) 在端口 17493 上
TTS 引擎	Qwen3-TTS, LuxTTS, Chatterbox, Chatterbox Turbo, TADA
效果	Pedalboard (Spotify)
转录	Whisper / Whisper Turbo
推理	MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
数据库	SQLite

Python FastAPI 后端处理所有机器学习推理。Tauri Rust shell 包装前端并管理后端进程的生命周期。即使在使用桌面应用程序时，API 也可以直接在 http://localhost:17493 访问。

REST API 参考

基础 URL：http://localhost:17493
交互式文档：http://localhost:17493/docs

生成语音

# 基本生成
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello world, this is a voice clone.",
    "profile_id": "abc123",
    "language": "en"
  }'

# 带引擎选择
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Speak slowly and with gravitas.",
    "profile_id": "abc123",
    "language": "en",
    "engine": "qwen3-tts"
  }'

# 带副语言标签 (仅 Chatterbox Turbo)
curl -X POST http://localhost:17493/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "That is absolutely hilarious! [laugh] I cannot believe it.",
    "profile_id": "abc123",
    "engine": "chatterbox-turbo",
    "language": "en"
  }'

语音配置文件

# 列出所有配置文件
curl http://localhost:17493/profiles

# 创建新配置文件
curl -X POST http://localhost:17493/profiles \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Narrator",
    "language": "en",
    "description": "Deep narrative voice"
  }'

# 向配置文件上传音频样本
curl -X POST http://localhost:17493/profiles/{profile_id}/samples \
  -F "file=@/path/to/voice-sample.wav"

# 导出配置文件
curl http://localhost:17493/profiles/{profile_id}/export \
  --output narrator-profile.zip

# 导入配置文件
curl -X POST http://localhost:17493/profiles/import \
  -F "file=@narrator-profile.zip"

生成队列与状态

# 获取生成状态 (SSE 流)
curl -N http://localhost:17493/generate/{generation_id}/status

# 列出最近的生成记录
curl http://localhost:17493/generations

# 重试失败的生成
curl -X POST http://localhost:17493/generations/{generation_id}/retry

# 下载生成的音频
curl http://localhost:17493/generations/{generation_id}/audio \
  --output output.wav

模型

# 列出可用模型和下载状态
curl http://localhost:17493/models

# 从 GPU 内存卸载模型（不删除）
curl -X POST http://localhost:17493/models/{model_id}/unload

TypeScript/JavaScript 集成

基本 TTS 客户端

const VOICEBOX_URL = process.env.VOICEBOX_API_URL ?? "http://localhost:17493";

interface GenerateRequest {
  text: string;
  profile_id: string;
  language?: string;
  engine?: "qwen3-tts" | "luxtts" | "chatterbox" | "chatterbox-turbo" | "tada";
}

interface GenerateResponse {
  generation_id: string;
  status: "queued" | "processing" | "complete" | "failed";
  audio_url?: string;
}

async function generateSpeech(req: GenerateRequest): Promise<GenerateResponse> {
  const response = await fetch(`${VOICEBOX_URL}/generate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(req),
  });

  if (!response.ok) {
    throw new Error(`Voicebox API 错误: ${response.status} ${await response.text()}`);
  }

  return response.json();
}

// 使用示例
const result = await generateSpeech({
  text: "Welcome to our application.",
  profile_id: "abc123",
  language: "en",
  engine: "qwen3-tts",
});

console.log("生成 ID:", result.generation_id);

轮询完成状态

async function waitForGeneration(
  generationId: string,
  timeoutMs = 60_000
): Promise<string> {
  const start = Date.now();

  while (Date.now() - start < timeoutMs) {
    const res = await fetch(`${VOICEBOX_URL}/generations/${generationId}`);
    const data = await res.json();

    if (data.status === "complete") {
      return `${VOICEBOX_URL}/generations/${generationId}/audio`;
    }
    if (data.status === "failed") {
      throw new Error(`生成失败: ${data.error}`);
    }

    await new Promise((r) => setTimeout(r, 1000));
  }

  throw new Error("生成超时");
}

使用 SSE 流式传输状态

function streamGenerationStatus(
  generationId: string,
  onStatus: (status: string) => void
): () => void {
  const eventSource = new EventSource(
    `${VOICEBOX_URL}/generate/${generationId}/status`
  );

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    onStatus(data.status);

    if (data.status === "complete" || data.status === "failed") {
      eventSource.close();
    }
  };

  eventSource.onerror = () => eventSource.close();

  // 返回清理函数
  return () => eventSource.close();
}

// 使用示例
const cleanup = streamGenerationStatus("gen_abc123", (status) => {
  console.log("状态更新:", status);
});

将音频下载为 Blob

async function downloadAudio(generationId: string): Promise<Blob> {
  const response = await fetch(
    `${VOICEBOX_URL}/generations/${generationId}/audio`
  );

  if (!response.ok) {
    throw new Error(`下载音频失败: ${response.status}`);
  }

  return response.blob();
}

// 在浏览器中播放
async function playGeneratedAudio(generationId: string): Promise<void> {
  const blob = await downloadAudio(generationId);
  const url = URL.createObjectURL(blob);
  const audio = new Audio(url);
  audio.play();
  audio.onended = () => URL.revokeObjectURL(url);
}

Python 集成

import httpx
import asyncio

VOICEBOX_URL = "http://localhost:17493"

async def generate_speech(
    text: str,
    profile_id: str,
    language: str = "en",
    engine: str = "qwen3-tts"
) -> bytes:
    async with httpx.AsyncClient(timeout=120.0) as client:
        # 提交生成任务
        resp = await client.post(
            f"{VOICEBOX_URL}/generate",
            json={
                "text": text,
                "profile_id": profile_id,
                "language": language,
                "engine": engine,
            }
        )
        resp.raise_for_status()
        generation_id = resp.json()["generation_id"]

        # 轮询直到完成
        for _ in range(120):
            status_resp = await client.get(
                f"{VOICEBOX_URL}/generations/{generation_id}"
            )
            status_data = status_resp.json()

            if status_data["status"] == "complete":
                audio_resp = await client.get(
                    f"{VOICEBOX_URL}/generations/{generation_id}/audio"
                )
                return audio_resp.content

            if status_data["status"] == "failed":
                raise RuntimeError(f"生成失败: {status_data.get('error')}")

            await asyncio.sleep(1.0)

        raise TimeoutError("生成超时，已等待 120 秒")


# 使用示例
audio_bytes = asyncio.run(
    generate_speech(
        text="The quick brown fox jumps over the lazy dog.",
        profile_id="your-profile-id",
        language="en",
        engine="chatterbox",
    )
)

with open("output.wav", "wb") as f:
    f.write(audio_bytes)

TTS 引擎选择指南

引擎	最佳用途	语言	显存	备注
`qwen3-tts` (0.6B/1.7B)	质量 + 指令	10 种	中等	支持在文本中嵌入传递指令
`luxtts`	快速的 CPU 生成	仅英语	~1GB	CPU 上 150 倍实时速度，48kHz
`chatterbox`	多语言覆盖	23 种	中等	阿拉伯语、印地语、斯瓦希里语、中日韩等
`chatterbox-turbo`	富有表现力/情感	仅英语	低 (350M)	使用 `[laugh]`、`[sigh]`、`[gasp]` 标签
`tada` (1B/3B)	长篇连贯性	10 种	高	700 秒以上音频，HumeAI 模型

传递指令 (Qwen3-TTS)

直接在文本中嵌入自然语言指令：

await generateSpeech({
  text: "(whisper) I have a secret to tell you.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

await generateSpeech({
  text: "(speak slowly and clearly) Step one: open the application.",
  profile_id: "abc123",
  engine: "qwen3-tts",
});

副语言标签 (Chatterbox Turbo)

const tags = [
  "[laugh]", "[chuckle]", "[gasp]", "[cough]",
  "[sigh]", "[groan]", "[sniff]", "[shush]", "[clear throat]"
];

await generateSpeech({
  text: "Oh really? [gasp] I had no idea! [laugh] That's incredible.",
  profile_id: "abc123",
  engine: "chatterbox-turbo",
});

环境与配置

# 自定义模型目录（在启动前设置）
export VOICEBOX_MODELS_DIR=/path/to/models

# 对于 AMD ROCm GPU（自动配置，但可以覆盖）
export HSA_OVERRIDE_GFX_VERSION=11.0.0

Docker 配置 (docker-compose.yml 覆盖)：

services:
  voicebox:
    environment:
      - VOICEBOX_MODELS_DIR=/models
    volumes:
      - /host/models:/models
    ports:
      - "17493:17493"
    # 对于 NVIDIA GPU 透传：
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

常见模式

语音配置文件创建流程

// 1. 创建配置文件
const profile = await fetch(`${VOICEBOX_URL}/profiles`, {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ name: "My Voice", language: "en" }),
}).then((r) => r.json());

// 2. 上传音频样本（WAV/MP3，理想情况下 5–30 秒的清晰语音）
const formData = new FormData();
formData.append("file", audioBlob, "sample.wav");

await fetch(`${VOICEBOX_URL}/profiles/${profile.id}/samples`, {
  method: "POST",
  body: formData,
});

// 3. 使用新配置文件生成语音
const gen = await generateSpeech({
  text: "Testing my cloned voice.",
  profile_id: profile.id,
});

带队列的批量生成

async function batchGenerate(
  items: Array<{ text: string; profileId: string }>,
  engine = "qwen3-tts"
): Promise<string[]> {
  // 提交所有任务 — Voicebox 会串行处理以避免 GPU 争用
  const submissions = await Promise.all(
    items.map((item) =>
      generateSpeech({ text: item.text, profile_id: item.profileId, engine })
    )
  );

  // 等待所有任务完成
  const audioUrls = await Promise.all(
    submissions.map((s) => waitForGeneration(s.generation_id))
  );

  return audioUrls;
}

长文本（自动分块）

Voicebox 会在句子边界自动分块 —— 只需发送完整文本：

const longScript = `
  第一章。晨雾笼罩着山谷...
  // 支持最多 50,000 个字符
`;

await generateSpeech({
  text: longScript,
  profile_id: "narrator-profile-id",
  engine: "tada", // 最适合长篇连贯性
  language: "en",
});

故障排除

API 无响应

# 检查后端是否正在运行
curl http://localhost:17493/health

# 仅重启后端（开发模式）
just backend

# 查看日志
just logs

未检测到 GPU

# 检查检测到的后端
curl http://localhost:17493/system/info

# 强制 CPU 模式（启动前设置）
export VOICEBOX_FORCE_CPU=1

模型下载失败 / 速度慢

# 设置自定义模型目录，确保有足够空间
export VOICEBOX_MODELS_DIR=/path/with/space
just dev

# 通过 API 取消卡住的下载
curl -X DELETE http://localhost:17493/models/{model_id}/download

显存不足 — 卸载模型

# 列出已加载的模型
curl http://localhost:17493/models | jq '.[] | select(.loaded == true)'

# 卸载特定模型
curl -X POST http://localhost:17493/models/{model_id}/unload

音频质量问题

为语音样本使用 5–30 秒的干净、无噪声音频
多个样本可提高克隆质量 — 上传 3–5 个不同的句子
对于多语言克隆，使用 chatterbox 引擎
为确保最佳效果，样本音频应为 16kHz+ 单声道 WAV
在英语中，使用 luxtts 可获得最高输出质量（48kHz）

崩溃后生成任务卡在队列中

Voicebox 会在启动时自动恢复失效的生成任务。如果问题持续存在：

curl -X POST http://localhost:17493/generations/{generation_id}/retry

前端集成 (React 示例)

import { useState } from "react";

const VOICEBOX_URL = import.meta.env.VITE_VOICEBOX_URL ?? "http://localhost:17493";

export function VoiceGenerator({ profileId }: { profileId: string }) {
  const [text, setText] = useState("");
  const [audioUrl, setAudioUrl] = useState<string | null>(null);
  const [loading, setLoading] = useState(false);

  const handleGenerate = async () => {
    setLoading(true);
    try {
      const res = await fetch(`${VOICEBOX_URL}/generate`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ text, profile_id: profileId, language: "en" }),
      });
      const { generation_id } = await res.json();

      // 轮询直到完成
      let done = false;
      while (!done) {
        await new Promise((r) => setTimeout(r, 1000));
        const statusRes = await fetch(`${VOICEBOX_URL}/generations/${generation_id}`);
        const { status } = await statusRes.json();
        if (status === "complete") {
          setAudioUrl(`${VOICEBOX_URL}/generations/${generation_id}/audio`);
          done = true;
        } else if (status === "failed") {
          throw new Error("生成失败");
        }
      }
    } finally {
      setLoading(false);
    }
  };

  return (
    <div>
      <textarea value={text} onChange={(e) => setText(e.target.value)} />
      <button onClick={handleGenerate} disabled={loading}>
        {loading ? "生成中..." : "生成语音"}
      </button>
      {audioUrl && <audio controls src={audioUrl} />}
    </div>
  );
}

📄 原始文档

完整文档（英文）：

https://skills.sh/aradotso/trending-skills/voicebox-voice-synthesis

💡 提示：点击上方链接查看 skills.sh 原始英文文档，方便对照翻译。

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

Voicebox Voice Synthesis Studio

🚀 快速安装

Voicebox 语音合成工作室

安装

预编译二进制文件（推荐）

从源代码构建

架构

REST API 参考

生成语音

语音配置文件

生成队列与状态

模型

TypeScript/JavaScript 集成

基本 TTS 客户端

轮询完成状态

使用 SSE 流式传输状态

将音频下载为 Blob

Python 集成

TTS 引擎选择指南

传递指令 (Qwen3-TTS)

副语言标签 (Chatterbox Turbo)

环境与配置

常见模式

语音配置文件创建流程

带队列的批量生成

长文本（自动分块）

故障排除

API 无响应

未检测到 GPU

模型下载失败 / 速度慢

显存不足 — 卸载模型

音频质量问题

崩溃后生成任务卡在队列中

前端集成 (React 示例)

📄 原始文档

评论(0)

提示：请文明发言取消回复

Voicebox Voice Synthesis Studio

🚀 快速安装

Voicebox 语音合成工作室

安装

预编译二进制文件（推荐）

从源代码构建

架构

REST API 参考

生成语音

语音配置文件

生成队列与状态

模型

TypeScript/JavaScript 集成

基本 TTS 客户端

轮询完成状态

使用 SSE 流式传输状态

将音频下载为 Blob

Python 集成

TTS 引擎选择指南

传递指令 (Qwen3-TTS)

副语言标签 (Chatterbox Turbo)

环境与配置

常见模式

语音配置文件创建流程

带队列的批量生成

长文本（自动分块）

故障排除

API 无响应

未检测到 GPU

模型下载失败 / 速度慢

显存不足 — 卸载模型

音频质量问题

崩溃后生成任务卡在队列中

前端集成 (React 示例)

📄 原始文档

评论(0)

提示：请文明发言 取消回复

相关文章

wechat-article-extractor – 汇易宝

xiaohongshu-cli — Xiaohongshu CLI Tool

prototype – 汇易宝

new-user – 汇易宝

提示：请文明发言取消回复