Qubittron Bastion
Guides

Audio pipelines

Transcribe speech with /v1/audio/transcriptions, reason with chat, speak back with /api/v1/tts/text_to_audio.

Bastion's audio surface has two endpoints:

This guide chains them: take a user's voice input → transcribe → run through an LLM → speak the answer back.

Step 1: transcribe

Multipart upload, OpenAI-shaped. The OpenAI SDK works directly:

import OpenAI from "openai";
import { createReadStream } from "node:fs";

const client = new OpenAI({
  baseURL: "https://api.qubittron.ai/v1",
  apiKey: process.env.QUBITTRON_API_KEY,
});

const transcription = await client.audio.transcriptions.create({
  model: "whisper-large-v3-turbo",
  file: createReadStream("./input.wav"),
});

console.log(transcription.text);

whisper-large-v3-turbo is the default; whisper-large-v3 is also available — see /v1/audio/transcriptions.

Caveats:

  • Max upload size is 25 MB. For longer audio, segment client-side and concatenate the transcripts.
  • Supported encodings depend on the upstream model; WAV and MP3 are the safe choices.
  • A 413 response means the upload exceeded the size cap; shrink or split.

Step 2: reason

Plain chat completion against the transcript:

const answer = await client.chat.completions.create({
  model: "gpt-oss-120b",
  messages: [
    {
      role: "system",
      content:
        "You are a concise voice assistant. Reply in one or two short sentences suitable for being spoken aloud.",
    },
    { role: "user", content: transcription.text },
  ],
});

const reply = answer.choices[0]!.message.content ?? "";

The "suitable for being spoken aloud" instruction is doing real work — without it, the model emits bullet lists and markdown, which sound terrible through TTS.

Step 3: speak

TTS lives at a different base path (/api/v1/tts/...) and the OpenAI SDK does not cover it. Call fetch directly:

const ttsRes = await fetch(
  "https://api.qubittron.ai/api/v1/tts/text_to_audio",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.QUBITTRON_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      text: reply,
      language_code: "en-US",
      encoding: 1, // LINEAR_PCM
      sample_rate_hz: 22050,
      voice_name: "English-US.Female-1",
    }),
  },
);

if (!ttsRes.ok) {
  throw new Error(`TTS failed: ${ttsRes.status}`);
}

const audio = Buffer.from(await ttsRes.arrayBuffer());
// audio is raw LINEAR_PCM @ 22.05 kHz; wrap in a WAV header before playing
// in a browser, or stream straight to a speaker via ffmpeg / sox.

See the TTS reference for the full encoding / sample-rate / voice matrix.

Putting it together

A minimal end-to-end function:

async function voiceTurn(inputPath: string): Promise<Buffer> {
  const transcript = await client.audio.transcriptions.create({
    model: "whisper-large-v3-turbo",
    file: createReadStream(inputPath),
  });

  const completion = await client.chat.completions.create({
    model: "gpt-oss-120b",
    messages: [
      {
        role: "system",
        content:
          "You are a concise voice assistant. Reply in one or two short sentences.",
      },
      { role: "user", content: transcript.text },
    ],
  });

  const reply = completion.choices[0]!.message.content ?? "";

  const ttsRes = await fetch(
    "https://api.qubittron.ai/api/v1/tts/text_to_audio",
    {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.QUBITTRON_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        text: reply,
        language_code: "en-US",
        encoding: 1, // LINEAR_PCM
        sample_rate_hz: 22050,
        voice_name: "English-US.Female-1",
      }),
    },
  );

  return Buffer.from(await ttsRes.arrayBuffer());
}

Latency tips

End-to-end voice latency is dominated by the LLM step. To keep round-trips snappy:

  • Stream the chat completion and start synthesizing TTS chunks as soon as you have a sentence boundary, rather than waiting for the full response.
  • Pick the smallest model that gives acceptable quality for your use case — gpt-oss-20b and Llama-3.1-8B-Instruct are good first picks for voice.
  • Cap max_tokens aggressively. Voice replies should be ~30–60 tokens; nobody wants to hear a paragraph.
  • 16 kHz mono LINEAR_PCM is half the bytes of 22.05 kHz and is plenty for speech.

For retry / streaming / production hardening, see Streaming and Errors and retries.

On this page