Qubittron Bastion
Guides

RAG with embeddings

Build a minimal retrieval-augmented generation loop using Bastion embeddings and chat completions.

This guide walks through the smallest useful RAG loop:

  1. Split your documents into chunks.
  2. Embed each chunk via /v1/embeddings.
  3. At query time, embed the question, find the closest chunks by cosine similarity, and pass them as context to /v1/chat/completions.

The example uses an in-memory store to keep the moving parts visible. In production you'd swap that for a vector database (pgvector, Qdrant, Pinecone, etc.) — the shape of the code stays the same.

1. Chunk and embed

Pick a chunking strategy that matches your content. For prose, ~500–1,000 tokens per chunk with ~10% overlap is a sensible default. The example below assumes you've already chunked into strings:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.qubittron.ai/v1",
  apiKey: process.env.QUBITTRON_API_KEY,
});

type Embedded = { text: string; vector: number[] };

async function embedAll(chunks: string[]): Promise<Embedded[]> {
  const res = await client.embeddings.create({
    model: "bge-m3",
    input: chunks,
  });
  return res.data.map((d, i) => ({ text: chunks[i]!, vector: d.embedding }));
}

bge-m3 is the default embedding model; see /v1/embeddings for Qwen3-Embedding-8B and bge-multilingual-gemma2 if you need higher dimensions or different language coverage. Embed in batches of 50–100 strings per call to amortize per-request overhead.

2. Cosine similarity

Cosine similarity is the standard scoring function for normalized embeddings:

function cosine(a: number[], b: number[]): number {
  let dot = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i]! * b[i]!;
    normA += a[i]! * a[i]!;
    normB += b[i]! * b[i]!;
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

function topK(query: number[], corpus: Embedded[], k = 4): Embedded[] {
  return [...corpus]
    .map((c) => ({ ...c, score: cosine(query, c.vector) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, k);
}

For a few thousand chunks this is fine in memory. Past that, use a vector DB with an ANN index — linear scan stops being free.

3. Retrieve, then answer

Embed the question, retrieve the top-k chunks, and inject them as the system message:

async function answer(question: string, corpus: Embedded[]): Promise<string> {
  const [{ embedding: qVec }] = (
    await client.embeddings.create({
      model: "bge-m3",
      input: [question],
    })
  ).data;

  const context = topK(qVec, corpus)
    .map((c, i) => `[chunk ${i + 1}]\n${c.text}`)
    .join("\n\n");

  const completion = await client.chat.completions.create({
    model: "gpt-oss-120b",
    messages: [
      {
        role: "system",
        content:
          "Answer using only the provided context. If the context doesn't contain the answer, say you don't know.\n\nContext:\n" +
          context,
      },
      { role: "user", content: question },
    ],
  });

  return completion.choices[0]!.message.content ?? "";
}

The "answer using only the context, say you don't know otherwise" instruction is the most important line in the prompt — without it, the model will fall back to its training data and silently confabulate.

4. What to improve next

Once the basic loop works, the high-leverage upgrades, in order:

  • Cite chunks. Have the model reference [chunk 1], [chunk 2] in its answer; expose those as clickable sources in your UI.
  • Better chunking. Recursive chunking that respects headings / paragraph boundaries usually beats naive character splits.
  • Hybrid search. Combine BM25 (keyword) and vector retrieval; rerank the union with a cross-encoder. Cheap quality wins.
  • Move to a real vector DB. Once you cross ~10k chunks or you need persistence, swap the in-memory corpus[] for pgvector or Qdrant. The retrieve-then-answer code stays identical.
  • Cache embeddings. Embeddings of unchanged chunks don't need to be recomputed every deploy. Persist them keyed by chunk hash.

For production hardening — retries, rate limits, observability — see Errors and retries and the Production checklist.

On this page