Traditional keyword search fails when users phrase queries differently from your content. For example, a user searches for "how to handle user login," but your article is titled "Authentication and Authorization in Strapi." Zero results. Frustrated user.
Retrieval-Augmented Generation (RAG) addresses this by converting your content into vector embeddings that capture meaning, not just keywords. When a user asks a question, the system finds semantically relevant content chunks, feeds them to a Large Language Model (LLM), and returns a grounded answer based on your actual documentation.
This guide walks you through building a complete RAG pipeline over your Strapi content using OpenAI embeddings, Qdrant as a vector database, and GPT-4o-mini for answer generation. By the end, you'll have a working /rag-search endpoint in your Strapi backend that answers natural-language questions using your own headless CMS content.
In brief:
- Extract and chunk Strapi content via the REST API, converting rich text to plain text for embedding.
- Generate vector embeddings with OpenAI's
text-embedding-3-smallmodel and store them in Qdrant. - Build a custom Strapi controller that handles query embedding, similarity search, and LLM response generation.
- Automate embedding sync using Strapi lifecycle hooks so your search index stays current.
Prerequisites
Before starting, you need:
- Strapi 5 project with at least one Collection Type containing content, for example,
articles - Node.js 18+ installed
- OpenAI API key from platform.openai.com
- Qdrant running locally via Docker (
docker run -p 6333:6333 qdrant/qdrant) or a free cloud cluster from qdrant.tech - Basic familiarity with Strapi REST API and async JavaScript
How RAG Works with CMS Content
The pipeline has two phases. During ingestion, you pull content from Strapi, split it into chunks, generate embeddings, and store them in a vector database. At query time, the user's question gets embedded with the same model, the vector database returns the most semantically similar chunks, and those chunks get injected into a prompt sent to GPT-4o-mini.
Ingestion: Strapi Content → Chunks → Embeddings → Qdrant
Query: User Question → Embedding → Qdrant Search → Context + Prompt → GPT-4o-mini → AnswerVector embeddings are arrays of floating-point numbers that represent text as points in high-dimensional space. The embedding model maps semantically similar phrases to nearby points, so "user login" and "authentication" can end up close together even though they share no keywords. This is what makes semantic search possible. You're comparing meaning, not strings.
When comparing these vectors, cosine similarity measures how closely aligned two vectors are, regardless of their magnitude. Higher scores indicate greater similarity, while lower scores indicate weaker similarity. Cosine similarity works well for text because it focuses on the orientation of the vector rather than its length, which makes it useful across varying document sizes.
You might wonder why not just fine-tune an LLM on your headless CMS content instead. The problem is that content changes constantly. Articles get published, updated, and archived daily. Fine-tuning is expensive, slow, and produces a static model that's immediately outdated when someone edits a page. RAG lets you update the knowledge base by re-embedding changed content, with no retraining required.
Grounding is central to why RAG reduces hallucination. When you ask an LLM a question directly, it generates answers from its training data, which may be outdated or wrong for your specific content. By retrieving actual chunks from your documentation and injecting them into the prompt, you constrain the model to answer based on evidence it can see. The LLM becomes a reasoning engine over your content rather than a guessing machine.
One critical constraint is that you need to use the same embedding model at both ingestion and query time.
Extract and Chunk Your Strapi Content
Start by pulling content from Strapi's REST API. You need a read-only API token for server-to-server extraction.
A common mistake is assuming populate relations are populated by default. You need to explicitly request them with the populate parameter.
Install dependencies first:
npm install openai @qdrant/js-client-rest qs markedCreate a script called scripts/ingest.js to fetch and chunk your content:
const qs = require('qs');
const { marked } = require('marked');
const STRAPI_URL = process.env.STRAPI_URL || 'http://localhost:1337';
const API_TOKEN = process.env.STRAPI_API_TOKEN;
// Fetch all entries with pagination
async function fetchAllContent(contentType) {
let page = 1;
let allEntries = [];
let hasMore = true;
while (hasMore) {
const query = qs.stringify({
pagination: { page, pageSize: 25 },
populate: { blocks: true, author: true, categories: true },
sort: ['publishedAt:desc']
}, { encodeValuesOnly: true });
const response = await fetch(`${STRAPI_URL}/api/${contentType}?${query}`, {
headers: { 'Authorization': `Bearer ${API_TOKEN}` }
});
const data = await response.json();
allEntries = [...allEntries, ...data.data];
const { page: currentPage, pageCount } = data.meta.pagination;
hasMore = currentPage < pageCount;
page++;
}
return allEntries;
}Using pagination during bulk extraction can help reduce load and avoid retrieving entire datasets in a single request.
Next, convert rich text to plain text. Strapi's Block Editor JSON uses JSON blocks, while the legacy editor uses Markdown. Neither format is ready for embedding as-is:
// For Markdown content (legacy editor)
function markdownToPlainText(markdown) {
const html = marked.parse(markdown);
return html.replace(/<[^>]*>/g, '').trim();
}
// For JSON Block content (new Block Editor)
function blocksToPlainText(blocks) {
if (!blocks || !Array.isArray(blocks)) return '';
return blocks
.filter(block => block.type === 'paragraph' || block.type === 'heading')
.map(block => block.children?.map(child => child.text).join('') || '')
.join('\n\n');
}This plain text conversion step helps embedding quality. Embedding models work best with clean, readable text. HTML tags, Markdown syntax characters, and JSON structure add noise that can affect the resulting vectors.
If you embed raw Markdown like ## Authentication, the model spends part of its capacity encoding the ## characters instead of just the word "Authentication." Stripping formatting helps the vector focus on the semantic meaning of the words rather than structural markup, which can improve similarity matches at query time.
Now chunk the content. For RAG, a practical starting point is chunks of roughly 200 to 800 tokens. Split by semantic boundaries first, then by paragraphs:
function chunkContent(plainText, metadata, maxChars = 3000) {
const sections = plainText.split(/\n#{1,3}\s/);
const chunks = [];
for (const section of sections) {
if (section.length <= maxChars) {
chunks.push({ content: section.trim(), metadata });
} else {
const paragraphs = section.split(/\n\n+/);
let currentChunk = '';
for (const para of paragraphs) {
if ((currentChunk + para).length > maxChars) {
if (currentChunk) chunks.push({ content: currentChunk.trim(), metadata });
currentChunk = para;
} else {
currentChunk += '\n\n' + para;
}
}
if (currentChunk) chunks.push({ content: currentChunk.trim(), metadata });
}
}
return chunks;
}That chunk range is a trade-off. A few patterns are worth keeping in mind:
- Chunks that are too small can lose the surrounding context that makes them meaningful. The LLM may not know what "it" refers to.
- Chunks that are too large can dilute the relevance signal. If only one paragraph out of 10 is relevant to the query, the other nine take up tokens in the LLM's context window and may confuse the answer.
- Splitting on headings first, rather than using fixed character counts, preserves semantic coherence. Each chunk tends to cover one topic or idea, which makes retrieved content more self-contained and useful.
- Some implementations add overlap between chunks by repeating the last 100 to 200 characters of one chunk at the start of the next. This guide skips overlap for simplicity, but it's worth testing if your search results miss relevant content that spans chunk boundaries.
The metadata attached to each chunk, title, URL, and documentId, is what enables source citation in the final answer. Without it, you'd have no way to link the user back to the original article.
Note that Strapi 5 uses a flattened format where attributes sit directly on the data object, unlike v4's nested attributes key. The extraction code below accounts for this:
async function extractAndChunkContent() {
const articles = await fetchAllContent('articles');
const chunks = articles.flatMap(article => {
const content = article.attributes || article; // v4 vs v5
const plainText = blocksToPlainText(content.blocks)
|| markdownToPlainText(content.body || '');
return chunkContent(plainText, {
documentId: article.documentId || article.id,
title: content.title,
url: `/articles/${content.slug}`,
updatedAt: content.updatedAt,
contentType: 'article'
});
});
return chunks;
}Generate Embeddings and Store Them in Qdrant
With chunks ready, generate embeddings using OpenAI's text-embedding-3-small model and store them in Qdrant. This model outputs 1,536-dimensional vectors by default, but you can reduce dimensions to save storage:
const OpenAI = require('openai');
const { QdrantClient } = require('@qdrant/js-client-rest');
const openai = new OpenAI(); // reads OPENAI_API_KEY from env
const qdrant = new QdrantClient({ url: process.env.QDRANT_URL || 'http://localhost:6333' });
const COLLECTION_NAME = 'strapi-content';
async function createCollection() {
await qdrant.createCollection(COLLECTION_NAME, {
vectors: { size: 1536, distance: 'Cosine' }
});
}Why text-embedding-3-small over text-embedding-3-large? The small model uses fewer dimensions, and for many headless CMS search use cases, it's a practical default. The large model outputs 3,072-dimensional vectors, which increases storage requirements and can make similarity search heavier.
It's worth considering if your content is highly technical or domain-specific, where finer semantic distinctions may matter. For general articles, blog posts, and documentation, the small model is a sensible starting point.
The Embeddings API accepts arrays of strings for batch processing, which is more efficient than sending single calls one at a time. Each input has an 8,192 token limit, but if you're chunking to about 3,000 characters, you'll stay well under that.
async function generateAndStoreEmbeddings(chunks) {
// Process in batches during ingestion
const batchSize = 20;
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: batch.map(c => c.content.replace(/\n/g, ' ').trim()),
});
const points = batch.map((chunk, idx) => ({
id: `${chunk.metadata.documentId}-${i + idx}`,
vector: embeddingResponse.data[idx].embedding,
payload: {
chunkId: `${chunk.metadata.documentId}-${i + idx}`,
text: chunk.content.substring(0, 1000),
title: chunk.metadata.title,
url: chunk.metadata.url,
updatedAt: chunk.metadata.updatedAt,
}
}));
await qdrant.upsert(COLLECTION_NAME, { points });
console.log(`Indexed batch ${i / batchSize + 1}`);
}
}Add a retry wrapper for rate limits. HTTP 429 errors are what you'll see when you hit limits:
async function createEmbeddingWithRetry(input, model = "text-embedding-3-small", maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await openai.embeddings.create({ model, input });
} catch (error) {
if (error.status === 429 && attempt < maxRetries - 1) {
const delay = 1000 * Math.pow(2, attempt);
console.log(`Rate limited. Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
} else {
throw error;
}
}
}
}Tie it all together and run the ingestion:
async function ingest() {
await createCollection();
const chunks = await extractAndChunkContent();
console.log(`Extracted ${chunks.length} chunks. Generating embeddings...`);
await generateAndStoreEmbeddings(chunks);
console.log('Ingestion complete.');
}
ingest().catch(console.error);Run it with:
STRAPI_API_TOKEN=your-token OPENAI_API_KEY=sk-... node scripts/ingest.jsBuild the RAG Search Endpoint in Strapi
Now build the search endpoint inside Strapi using a custom controller. This follows Strapi's standard createCoreController pattern.
First, configure your API keys using Strapi environment configuration:
# .env
OPENAI_API_KEY=sk-...
QDRANT_URL=http://localhost:6333Create the route file:
// src/api/rag-search/routes/rag-search.js
module.exports = {
routes: [
{
method: 'POST',
path: '/rag-search',
handler: 'rag-search.search',
config: {
policies: ['global::is-authenticated'],
middlewares: [],
},
},
],
};Strapi organizes custom APIs under src/api/[api-name]/ with separate directories for routes, controllers, and services. The route file maps HTTP methods and paths to controller actions. Here, a POST to /rag-search calls the search method on the rag-search controller. The config.policies array lets you attach authentication or authorization checks that run before the handler executes.
For production use, consider adding authentication policies or rate limiting via the Upstash rate limit plugin, though the plugin is currently marked as experimental.
Now the controller. This is where the RAG pipeline comes together:
// src/api/rag-search/controllers/rag-search.js
const { createCoreController } = require('@strapi/strapi').factories;
const OpenAI = require('openai');
const { QdrantClient } = require('@qdrant/js-client-rest');
module.exports = createCoreController('api::rag-search.rag-search', ({ strapi }) => ({
async search(ctx) {
try {
const { query } = ctx.request.body;
if (!query || typeof query !== 'string') {
return ctx.throw(400, 'Valid query string is required');
}
if (query.length > 1000) {
return ctx.throw(400, 'Query exceeds maximum length of 1000 characters');
}
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const qdrant = new QdrantClient({ url: process.env.QDRANT_URL });
// 1. Embed the query
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query.replace(/\n/g, ' ').trim(),
});
const queryEmbedding = embeddingResponse.data[0].embedding;
// 2. Search for relevant chunks
const searchResults = await qdrant.search('strapi-content', {
vector: queryEmbedding,
limit: 5,
with_payload: true,
});
if (!searchResults.length) {
return ctx.send({
data: { answer: 'No relevant content found.', sources: [] }
});
}
// 3. Build context from retrieved chunks
const context = searchResults
.map((m, i) => `[Source ${i + 1}: ${m.payload.title}]\n${m.payload.text}`)
.join('\n\n---\n\n');
// 4. Generate grounded answer
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `You are a helpful assistant that answers questions based on the provided documentation.
IMPORTANT RULES:
1. Answer ONLY using information from the context below
2. If the context doesn't contain enough information, say so
3. Cite which source(s) you used by referencing [Source N]
4. Do not make up facts not present in the context
Context:
${context}`,
},
{ role: 'user', content: query },
],
temperature: 0.1,
max_tokens: 800,
});
return ctx.send({
data: {
answer: completion.choices[0].message.content,
sources: searchResults.map(m => ({
score: m.score,
title: m.payload.title,
url: m.payload.url,
excerpt: m.payload.text?.substring(0, 200) + '...',
})),
},
});
} catch (error) {
if (error.status === 429) {
return ctx.throw(429, 'AI service rate limit exceeded. Please retry shortly.');
}
strapi.log.error('RAG search error:', error);
return ctx.throw(500, 'An error occurred during search');
}
},
}));The limit: 5 parameter in the Qdrant search controls how many chunks are retrieved as context for the LLM. Five chunks is a useful starting point for many implementations because it gives the model multiple relevant passages without making the prompt unnecessarily large. Depending on your chunk size and content density, you can experiment with nearby values. If your chunks are small, you may need more. If they're larger, fewer may be enough.
In production, it helps to filter out low-relevance results before feeding them to the LLM. Each result from Qdrant includes a score field, and lower-scoring results may be noise rather than signal. Adding a score filter can help keep the context cleaner:
const relevant = searchResults.filter(r => r.score > 0.7);Treat 0.7 as an example starting point and adjust based on your content. If you're getting too few results, lower it. If answers seem off-topic, raise it.
The prompt engineering here is deliberate. Setting temperature: 0.1 keeps responses more constrained, which is useful for retrieval-based answers. Temperature affects how variable the model's output is. For RAG, where you want answers grounded in retrieved content, a lower temperature helps reduce improvisation beyond what the context supports. The system prompt's explicit rules may help reduce hallucination, but their effect is limited and not reliably established. The [Source N] citation pattern lets your frontend link back to original content.
Test it with:
curl -X POST http://localhost:1337/api/rag-search \
-H "Content-Type: application/json" \
-d '{"query": "How do I authenticate users?"}'Production Optimizations
A few things are worth considering before shipping this:
- Reduce dimensions. Pass
dimensions: 512toopenai.embeddings.create()withtext-embedding-3-smallto cut vector storage by 66%. Update your Qdrant collection'ssizeaccordingly. - Use
encoding_format: "base64". This can reduce API response payload size for embedding requests. - Rate limit your endpoint. The
express-rate-limitpackage or Upstash rate limit plugin can prevent abuse and control OpenAI costs. - Handle error codes gracefully. Common API errors should be logged server-side, while clients get generic errors.
- Add streaming. The streaming API lets you pipe tokens to the frontend as they're generated, so users aren't staring at a spinner.
For existing Strapi plugins in this space, the Strapi blog post on building a semantic search plugin with OpenAI is worth reading if you want to package your implementation as a reusable Strapi plugin.
Keep Embeddings in Sync with Lifecycle Hooks
Your vector index can become outdated when someone publishes or updates content in the Admin Panel unless you have automatic synchronization in place. Lifecycle hooks can be used to trigger embedding updates on content create, update, and delete events, though in Strapi v5 they are triggered based on Document Service API methods and Strapi now generally recommends document service middleware for most use cases.
Register hooks in the bootstrap function:
// src/index.js
const OpenAI = require('openai');
const { QdrantClient } = require('@qdrant/js-client-rest');
module.exports = {
async bootstrap({ strapi }) {
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const qdrant = new QdrantClient({ url: process.env.QDRANT_URL });
strapi.db.lifecycles.subscribe({
models: ['api::article.article'],
async afterCreate(event) {
const { result } = event;
await upsertEmbedding(openai, qdrant, result);
},
async afterUpdate(event) {
const { result } = event;
await upsertEmbedding(openai, qdrant, result);
},
async beforeDelete(event) {
const { params } = event;
const id = params.where.id.toString();
const existing = await qdrant.scroll('strapi-content', {
limit: 100,
with_payload: true,
});
const points = existing
.filter(point => point.payload?.chunkId?.startsWith(`${id}-`))
.map(point => point.payload.chunkId);
await qdrant.delete('strapi-content', { points });
strapi.log.info(`Deleted embedding for article ${id}`);
},
});
},
};
async function upsertEmbedding(openai, qdrant, entry) {
try {
const textToEmbed = [entry.title, entry.description, entry.body || blocksToPlainText(entry.blocks)]
.filter(Boolean)
.join('\n\n');
if (!textToEmbed.trim()) return;
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: textToEmbed.replace(/\n/g, ' ').trim(),
});
await qdrant.upsert('strapi-content', {
points: [{
id: `${entry.documentId || entry.id.toString()}-0`,
vector: embeddingResponse.data[0].embedding,
payload: {
chunkId: `${entry.documentId || entry.id.toString()}-0`,
text: textToEmbed.substring(0, 1000),
title: entry.title,
url: `/articles/${entry.slug}`,
updatedAt: entry.updatedAt,
},
}]
});
strapi.log.info(`Upserted embedding for: ${entry.title}`);
} catch (error) {
strapi.log.error('Embedding upsert failed:', error.message);
}
}An important detail is the try/catch above. It lets the hook handle a failed embedding call locally so the remaining logic in that hook can continue, unless the error is re-thrown. Handle that trade-off intentionally. For some teams, failing silently is acceptable. Others might prefer a webhook approach that processes embeddings asynchronously in a separate service.
The beforeDelete hook attempts to use the entry's ID to remove the corresponding vector from Qdrant, helping keep the index clean. Without it, deleted articles would continue appearing in search results, which is a confusing experience for users.
Be aware of bulk operations. In Strapi v5, bulk actions like createMany, updateMany, and deleteMany do not trigger lifecycles at all when called through the Document Service API, so bulk imports using these methods will not fire lifecycle hooks per record. This can quickly hit OpenAI rate limits. For bulk imports, consider using your embedding provider's supported batch-ingestion features or a dedicated import workflow designed for bulk operations.
If you're on Strapi 5, also review the Document Service Middleware docs. Lifecycle hooks still work, but Document Service Middleware may be a better fit depending on your architecture.
Troubleshoot Common Issues
Empty search results. Check that your Qdrant collection name matches between ingestion and query. It should be strapi-content in both the ingestion script and the controller. Verify embeddings were actually stored by hitting Qdrant's REST API directly: GET http://localhost:6333/collections/strapi-content should return a points_count greater than zero. If the count is zero, your ingestion script didn't complete successfully.
Irrelevant results returned. Your chunks may be too large or contain too much boilerplate. Review the plain text output of your blocksToPlainText or markdownToPlainText functions and ensure they strip navigation elements, footers, and repeated content. Try reducing maxChars in the chunking function from 3000 to 1500 and re-running ingestion.
OpenAI 401 errors. Your OPENAI_API_KEY environment variable is missing or invalid. Verify it's set in your .env file and that Strapi is loading it. Restart the server after any .env changes. You can test the key independently with a simple curl call to the OpenAI API.
Lifecycle hooks not firing. In Strapi 5, ensure you're subscribing to the correct model UID format (api::article.article). Check the Strapi server logs on startup for any errors in the bootstrap function. If you don't see your strapi.log.info messages after creating or updating content, the subscription may not have registered.
Qdrant connection refused. Make sure your Qdrant Docker container is running (docker ps) and the port mapping is correct (-p 6333:6333). If you're using Qdrant Cloud, verify the QDRANT_URL includes the full URL with protocol and that any API key is configured correctly.
Slow search responses. Most latency comes from the two OpenAI API calls, embedding generation and chat completion. To reduce perceived latency, consider caching frequent query embeddings so repeated or similar questions skip the embedding step. You can also use streaming responses so users see the answer forming in real time rather than waiting for the full completion to finish before displaying anything.
Stale results after content updates. If lifecycle hooks are configured but search results still show old content, check that the documentId used during ingestion matches the ID used in lifecycle hook upserts. A mismatch means updates create new vectors instead of overwriting existing ones, leaving outdated vectors in the index alongside the new ones.
What You've Built and Where to Go Next
You now have a working RAG pipeline that turns your Strapi content into an intelligent search system. The pieces are straightforward: extract content via the REST API, embed it with OpenAI, store vectors in Qdrant, and query the whole thing through a custom Strapi endpoint. Content create, update, and delete events can be used to keep external search indexes in sync without full manual re-indexing, though in Strapi 5 the recommended approach for most cases is document service middleware rather than lifecycle hooks.
From here, you could extend this with hybrid search, combining keyword and semantic matching, multi-language content support, or a frontend chat interface using the Vercel AI SDK. The core architecture stays the same.
If you're evaluating how Strapi fits into your AI content stack, the integrations page covers how it connects with frontends like Next.js, and the SDK comparison on the Strapi blog can help you choose the right abstraction layer for your frontend and headless CMS workflow.
Get Started in Minutes
npx create-strapi-app@latest in your terminal and follow our Quick Start Guide to build your first Strapi project.