Audio pipelines
Transcribe speech with /v1/audio/transcriptions, reason with chat, speak back with /api/v1/tts/text_to_audio.
Bastion's audio surface has two endpoints:
POST /v1/audio/transcriptions— speech-to-text, OpenAI-compatible multipart upload.POST /api/v1/tts/text_to_audio— text-to-speech via NVIDIA Riva (note: different base path).
This guide chains them: take a user's voice input → transcribe → run through an LLM → speak the answer back.
Step 1: transcribe
Multipart upload, OpenAI-shaped. The OpenAI SDK works directly:
import OpenAI from "openai";
import { createReadStream } from "node:fs";
const client = new OpenAI({
baseURL: "https://api.qubittron.ai/v1",
apiKey: process.env.QUBITTRON_API_KEY,
});
const transcription = await client.audio.transcriptions.create({
model: "whisper-large-v3-turbo",
file: createReadStream("./input.wav"),
});
console.log(transcription.text);whisper-large-v3-turbo is the default; whisper-large-v3 is also available — see /v1/audio/transcriptions.
Caveats:
- Max upload size is 25 MB. For longer audio, segment client-side and concatenate the transcripts.
- Supported encodings depend on the upstream model; WAV and MP3 are the safe choices.
- A
413response means the upload exceeded the size cap; shrink or split.
Step 2: reason
Plain chat completion against the transcript:
const answer = await client.chat.completions.create({
model: "gpt-oss-120b",
messages: [
{
role: "system",
content:
"You are a concise voice assistant. Reply in one or two short sentences suitable for being spoken aloud.",
},
{ role: "user", content: transcription.text },
],
});
const reply = answer.choices[0]!.message.content ?? "";The "suitable for being spoken aloud" instruction is doing real work — without it, the model emits bullet lists and markdown, which sound terrible through TTS.
Step 3: speak
TTS lives at a different base path (/api/v1/tts/...) and the OpenAI SDK does not cover it. Call fetch directly:
const ttsRes = await fetch(
"https://api.qubittron.ai/api/v1/tts/text_to_audio",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.QUBITTRON_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
text: reply,
language_code: "en-US",
encoding: 1, // LINEAR_PCM
sample_rate_hz: 22050,
voice_name: "English-US.Female-1",
}),
},
);
if (!ttsRes.ok) {
throw new Error(`TTS failed: ${ttsRes.status}`);
}
const audio = Buffer.from(await ttsRes.arrayBuffer());
// audio is raw LINEAR_PCM @ 22.05 kHz; wrap in a WAV header before playing
// in a browser, or stream straight to a speaker via ffmpeg / sox.See the TTS reference for the full encoding / sample-rate / voice matrix.
Putting it together
A minimal end-to-end function:
async function voiceTurn(inputPath: string): Promise<Buffer> {
const transcript = await client.audio.transcriptions.create({
model: "whisper-large-v3-turbo",
file: createReadStream(inputPath),
});
const completion = await client.chat.completions.create({
model: "gpt-oss-120b",
messages: [
{
role: "system",
content:
"You are a concise voice assistant. Reply in one or two short sentences.",
},
{ role: "user", content: transcript.text },
],
});
const reply = completion.choices[0]!.message.content ?? "";
const ttsRes = await fetch(
"https://api.qubittron.ai/api/v1/tts/text_to_audio",
{
method: "POST",
headers: {
Authorization: `Bearer ${process.env.QUBITTRON_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
text: reply,
language_code: "en-US",
encoding: 1, // LINEAR_PCM
sample_rate_hz: 22050,
voice_name: "English-US.Female-1",
}),
},
);
return Buffer.from(await ttsRes.arrayBuffer());
}Latency tips
End-to-end voice latency is dominated by the LLM step. To keep round-trips snappy:
- Stream the chat completion and start synthesizing TTS chunks as soon as you have a sentence boundary, rather than waiting for the full response.
- Pick the smallest model that gives acceptable quality for your use case —
gpt-oss-20bandLlama-3.1-8B-Instructare good first picks for voice. - Cap
max_tokensaggressively. Voice replies should be ~30–60 tokens; nobody wants to hear a paragraph. - 16 kHz mono LINEAR_PCM is half the bytes of 22.05 kHz and is plenty for speech.
For retry / streaming / production hardening, see Streaming and Errors and retries.