RAG with embeddings
Build a minimal retrieval-augmented generation loop using Bastion embeddings and chat completions.
This guide walks through the smallest useful RAG loop:
- Split your documents into chunks.
- Embed each chunk via
/v1/embeddings. - At query time, embed the question, find the closest chunks by cosine similarity, and pass them as context to
/v1/chat/completions.
The example uses an in-memory store to keep the moving parts visible. In production you'd swap that for a vector database (pgvector, Qdrant, Pinecone, etc.) — the shape of the code stays the same.
1. Chunk and embed
Pick a chunking strategy that matches your content. For prose, ~500–1,000 tokens per chunk with ~10% overlap is a sensible default. The example below assumes you've already chunked into strings:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.qubittron.ai/v1",
apiKey: process.env.QUBITTRON_API_KEY,
});
type Embedded = { text: string; vector: number[] };
async function embedAll(chunks: string[]): Promise<Embedded[]> {
const res = await client.embeddings.create({
model: "bge-m3",
input: chunks,
});
return res.data.map((d, i) => ({ text: chunks[i]!, vector: d.embedding }));
}bge-m3 is the default embedding model; see /v1/embeddings for Qwen3-Embedding-8B and bge-multilingual-gemma2 if you need higher dimensions or different language coverage. Embed in batches of 50–100 strings per call to amortize per-request overhead.
2. Cosine similarity
Cosine similarity is the standard scoring function for normalized embeddings:
function cosine(a: number[], b: number[]): number {
let dot = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i]! * b[i]!;
normA += a[i]! * a[i]!;
normB += b[i]! * b[i]!;
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
function topK(query: number[], corpus: Embedded[], k = 4): Embedded[] {
return [...corpus]
.map((c) => ({ ...c, score: cosine(query, c.vector) }))
.sort((a, b) => b.score - a.score)
.slice(0, k);
}For a few thousand chunks this is fine in memory. Past that, use a vector DB with an ANN index — linear scan stops being free.
3. Retrieve, then answer
Embed the question, retrieve the top-k chunks, and inject them as the system message:
async function answer(question: string, corpus: Embedded[]): Promise<string> {
const [{ embedding: qVec }] = (
await client.embeddings.create({
model: "bge-m3",
input: [question],
})
).data;
const context = topK(qVec, corpus)
.map((c, i) => `[chunk ${i + 1}]\n${c.text}`)
.join("\n\n");
const completion = await client.chat.completions.create({
model: "gpt-oss-120b",
messages: [
{
role: "system",
content:
"Answer using only the provided context. If the context doesn't contain the answer, say you don't know.\n\nContext:\n" +
context,
},
{ role: "user", content: question },
],
});
return completion.choices[0]!.message.content ?? "";
}The "answer using only the context, say you don't know otherwise" instruction is the most important line in the prompt — without it, the model will fall back to its training data and silently confabulate.
4. What to improve next
Once the basic loop works, the high-leverage upgrades, in order:
- Cite chunks. Have the model reference
[chunk 1],[chunk 2]in its answer; expose those as clickable sources in your UI. - Better chunking. Recursive chunking that respects headings / paragraph boundaries usually beats naive character splits.
- Hybrid search. Combine BM25 (keyword) and vector retrieval; rerank the union with a cross-encoder. Cheap quality wins.
- Move to a real vector DB. Once you cross ~10k chunks or you need persistence, swap the in-memory
corpus[]for pgvector or Qdrant. The retrieve-then-answer code stays identical. - Cache embeddings. Embeddings of unchanged chunks don't need to be recomputed every deploy. Persist them keyed by chunk hash.
For production hardening — retries, rate limits, observability — see Errors and retries and the Production checklist.