The Different RAG Approaches in 2026
In this first blog article of mine, I want to talk about RAG: what it is, why it's still relevant, and all the challenges and things you have to think about if you're gonna set one up today.
Everybody uses ChatGPT, Claude or whatever, but these models are of limited help if they don't have access to your specific business data. So the idea of RAG (Retrieval-Augmented Generation) is to give your LLM access to an external knowledge base, so it can fetch relevant documents and use them as context to answer questions.
In practice, this means you take a user's question, turn it into a vector embedding, search a database of pre-indexed document chunks for the most similar ones, and feed those to an LLM alongside the question. That's the baseline.
But I've been building RAG applications for the better part of the last two or three years now, and there's a lot more to it than just that.

"Wait... but isn't RAG dead?"
LLMs evolve rapidly, and it's easy to be fooled about what tech is just "hype" and what is actually here to stay: first we had LLMs, then we figured out prompt engineering, then came RAG, function calling, MCP servers, and now AI agents with massive context windows. I actually read a cool article about this, that said this whole retrieval thing is just one of the many tools you should give an AI agent to use, and let it figure out when to use it (what they now call "context engineering", the latest buzzword), evaluate results, iterate, etc.
So yeah, even though the original RAG (single vector search, top-k results stuffed into a prompt) might be limited, you will still need to index your business data for retrieval. So I think it's important to understand how basic RAG works because its core principles (chunking, embedding, retrieval, reranking) are the same regardless of whether you're doing it in a simple RAG setup or as part of a more complex agentic architecture. Also, of course you don't want to dump everything into the context window (it's slow, expensive, and it's a privacy nightmare).
So RAG isn't dead, it has become infrastructure. Like databases. Nobody writes blog posts titled "Are Databases Dead?" because that would be absurd, and I feel like RAG is on the same trajectory here.
The challenges of building a RAG
There's a huge gap between knowing what RAG is and actually getting it to work well. Basically, the challenges for building a RAG fall into 2 phases:
- indexation (preparing your data for retrieval)
- query time (actually retrieving and generating)
Indexation pipeline
This pipeline, that processes your documents and prepares them for retrieval, is probably the most important of the 2 phases. It happens offline, before a user ever asks a question. Typically, you would run it as a cron job if your corpus changes over time, or as a one-time batch job if your documents are static.
Getting a clean corpus
Your RAG can only be as good as the data you feed it. If your source corpus is a mess of PDFs with weird/unconsistent formatting, or scanned documents with no OCR, then your chunks will be garbage, your embeddings will be garbage, and your answers will be garbage.
What you want is a clean, structured text format: Markdown. That's because LLMs already speak Markdown natively: ChatGPT actually answers you with # headings, **bold**, `code blocks`, bullet lists etc. that are simply rendered in a more human-friendly way in the UI. If you feed your RAG pipeline Markdown documents, the query retrieval and generation will benefit from that structural clarity.

So this whole data cleaning step is very important. I recommend you check out MarkItDown: it's a Python tool from Microsoft that will convert pretty much any document (PDFs, Docs, HTML, etc.) into clean Markdown. I heard docling is good too but MarkItDown has always been enough for the documents I've worked with: you can even pass your LLM API key and have it directly describe images.
So now, you have nice Markdown. But you will also need to clean it up: remove navigation elements, footers, really anything that isn't semantically relevant. This is where you find out that having documents that are all structured in a consistent way is a huge help. Otherwise, you will need to say "apply this regex to remove the footer, but only on documents XYZ...". Meh.
Chunking strategy
Once you have clean text, you need to split it into chunks: pieces of text small enough to be useful context, but large enough to preserve meaning. There are several strategies for this that I know of, and the one to choose will depend on how well your Markdown is structured and how long your documents are.
Regardless of which strategy you pick, I found this one trick that consistently improves retrieval quality: prepend a document context to each chunk. The idea is simple: before embedding a chunk, you add a header to its content that tells the model where this chunk came from.
1[Source: HR Policies > Remote Work > 2026 Guidelines.pdf]
2
3Employees working remotely must ensure a stable internet
4connection and a dedicated workspace. Requests for remote
5work equipment must be submitted through the IT portal
6at least two weeks in advance...Why does adding this context help?
Because when you do embedding on a chunk like this, the vector captures not only the content but also the origin. So, a query like "What are the remote work equipment policies?" will now match better: the embedding already "knows" that this chunk lives in an HR policy document about remote work.
Without that prefix, the chunk is just a paragraph about internet connections and equipment requests, with no indication it came from an official policy. So that's a small change that makes retrieval way more precise, especially when you have documents with overlapping topics across different departments.
Choosing the right embedding model
Next, choosing the right enbedding model is important because it will determine how your chunks are represented in vector space. For most use cases, a general-purpose model like text-embedding-3-large or an open-source option from the MTEB leaderboard will work just fine.
But sometimes it won't: I remember working with a team that was building a RAG over internal French technical docs, and they found that these "off-the-shelf" models kept confusing company-specific jargon with public-domain meanings. So they ended up fine-tuning their own model from CamemBERT. The way you do this is by creating a dataset of triples: (question, relevant chunk, irrelevant chunk). For example:
1Question: How does a transformer work in Natural Language Processing?
2Relevant chunk: A transformer uses attention mechanisms to weigh the importance of different words in a sentence.
3Irrelevant chunk: The traditional apple pie recipe involves a flaky crust and a cinnamon-spiced filling.And apparently, they got pretty good results from this. So if your domain has its own vocabulary, it's worth benchmarking seriously before assuming a general model will cut it.
Picking a vector database
I've used Qdrant extensively for local development (also a bit of Milvus which is good too) because it's easy to spin up with Docker. In production, I've been using Azure AI Search, and also a little bit of Pinecone (SaaS version) during an internship in Montreal because it's really easy to use for prototyping.
So which one to choose? Well, each of them does the basics (vector CRUD, k-NN search), but you will need to ask yourself if you need more advanced features:
- do I need to be able to filter vectors by metadata during retrieval?
- do I need hybrid search (combining vector similarity with keyword matching)?
- is cosine similarity enough for my use case or do I need another similarity metric?
Also, some use cases I stumbled upon involve having a custom scoring function during retrieval: for example, you might want to get vectors that are relevant to the question but also more recent, or more important based on some business logic. I'll show a concrete example of this later but for now, just know that the vector database you choose will constrain what retrieval strategies you can perform, so think about it early.
Now let's look at how the actual indexation pipeline is structured. There are a few main approaches.
Standard indexation
This is the straightforward way: you take your documents, clean them, chunk them, embed each chunk, and store them in your vector database.
Standard indexation pipeline
In pseudocode:
1for doc in documents:
2 text = markdown_and_clean(doc)
3 chunks = chunk_text(text, size=512, overlap=50)
4 for chunk in chunks:
5 vector = embedding_model.encode(chunk)
6 vector_db.insert(vector=vector, content=chunk, metadata=doc.metadata)This is a good starting point, it works. But it may be limited for your use case, because, at query time you will be searching for chunks that are similar to a question, not chunks that answer the question.
Question-based indexation
So another approach is: instead of indexing the raw text, you generate synthetic questions for each chunk and index those. You then embed these generated questions and store them in the vector database, but the content of each chunk is still the original text. It's a bit more expensive, but, at query time, you will retrieve chunks that are semantically closer to the user's question, which will fit better.
Question-based indexation
The generation prompt you would use to generate the questions is something like:
1Given the following text, generate 3-5 questions that this text answers.
2Only generate questions that are directly answered by the text.
3
4Text:
5{chunk_content}
6
7Questions:Of course, you will want to use a structured output format for this kind of prompt, to ensure that you always parse questions in a consistent manner. I've never actually implemented a question-based indexation myself. I first heard about it at DevFest Nantes 2025 and thought it was worth sharing here.
Summary-based indexation
Another thing you can do is to generate summaries for each chunk and index those instead, while keeping the original texts in the vector database. This helps with more general queries like "What's the company's approach to data privacy?" in opposition to specific queries like "What's the retention period for customer data in the EU?".
However, this might not be suitable if you're working with very precise vocabulary, like legal documents or medical records: for these, you might lose critical nuance in the summary.
Query time
So now you have nice indexed chunks. But query time is really where everything comes together, that is what happens when a user actually asks something.
Query reformulation
Users suck at writing queries. They're vague, ambiguous, or they use completely different words than what's in your documents. So query reformulation is really important before doing the vector search. There are several ways you can go about this:
Query rewriting: you ask the LLM to rephrase the user question into a more precise query. If you're building a conversational bot, this is also where you account for previous messages. For example the user might just say "what about the second one?" and the rewriter needs to resolve that into a self-contained query from the conversation history.
HyDE (Hypothetical Document Embeddings): it's a bit like that "question-based approach" I mentioned earlier. You ask the LLM to write a hypothetical answer to the user question, and do your retrieval with it. The idea is that a hypothetical answer is closer in embedding space to real answers than the question itself. I also heard about this one at DevFest but haven't tried it yet, so I'm not sure how well it works in practice.
Retrieval and reranking
Then, you need to ask yourself: how many chunks should I retrieve?
- Too few, and you might miss relevant context that the model needs to answer the question.
- Too many, and you might pollute your final prompt with garbage or irrelevant information.
A common pattern is to over-retrieve then rerank: for example, fetch the top 20-50 chunks by vector similarity, then use a reranker model (like Cohere's Cohere Rerank 4.0 Fast) to re-score them with much higher accuracy. Then, you can keep the top 5-10 after reranking or just use a relevance threshold.
At query time, using reranking is a bit slower, but way more accurate so don't forget about it. Also, if your vector database supports hybrid search, you can combine vector similarity with keyword matching (take a look at BM25) to get better results.
Prompt structure
So now, you you have your relevant chunks, but how you will present them to the LLM matters too. Typically, a good structure is something like this:
1Use the following context to answer the user's question.
2If the context doesn't contain enough information, say so.
3Do not make up information that isn't in the context.
4
5Context:
6---
7[Chunk 1: {title} - {source}]
8{content}
9
10[Chunk 2: {title} - {source}]
11{content}
12---
13
14Question: {user_question}Of course, this is a very basic structure. You might want to give your model:
- a specific identity,
- instructions on how to format the answer,
- the length or tone of the answer,
- examples of good answers, etc.
As you can see, the chunks are numbered and include source metadata. These can help the LLM keep track of where information is coming from and to better cite sources in its answer.
So now, just like indexation before, let's go over some of the different architectural approaches to the query side.
Standard RAG
The standard RAG approach is simple: you embed the query, retrieve relevant chunks, stuff them in context, and generate the answer.
Standard RAG query
This may work fine for single-step factual questions. But this "single retrieval" approach will likely fail for anything that requires reasoning across different domains of your documentation, or for questions that are ambiguous and require follow-up.
You're limited by what you retrieve in the first place
For example, if a user asks "What are the guidelines for remote work equipment?", the model might retrieve a chunk from the HR policy that talks about remote work, but it might miss another relevant chunk from the IT policy that lists the specific equipment allowed. So the model will generate an answer based on incomplete information.
RAG with "sub-queries"
I don't know if there's a name for this, but that's how I call it. Basically, from the user question (and maybe the conversation history), you ask the LLM to generate multiple sub-queries that cover different aspects of the question. For example, for the same question about remote work equipment, the LLM might generate sub-queries like:
- "What are the remote work policies in the HR guidelines?"
- "What equipment is allowed for remote work in the IT policy?"
- "Are there any specific requirements for remote work equipment in the EU regulations?"
And then, you proceed to retrieval for each of them in parallel.
RAG with sub-queries
Then, you take all the retrieved chunks from the different sub-queries, deduplicate them, rerank them together, and feed the top ones to the LLM for generation just like before. This approach works great for me, and that's how most of the RAG applications I've built are structured.
Agentic RAG
Now, agentic RAG is about as good as it gets. Instead of having a fixed retrieval strategy, you give the model the ability to control the retrieval loop itself. It decides what to search for, evaluates the results, and either generates a response or refines its query and tries again.
Agentic RAG
In code, your agent loop looks roughly something like this:
1context = []
2query = user_question
3
4for step in range(max_iterations):
5 results = retriever.search(query, top_k=10)
6 context.extend(results)
7
8 evaluation = llm.evaluate(
9 question=user_question,
10 context=context,
11 prompt="Is this context sufficient to answer the question? "
12 "Respond with SUFFICIENT or NEED_MORE: <search query>"
13 )
14
15 if evaluation.startswith("SUFFICIENT"):
16 break
17 query = evaluation.replace("NEED_MORE:", "").strip()
18
19answer = llm.generate(question=user_question, context=context)However, in practice, you probably want to use an agent framework like LangGraph (from LangChain) to do this, because it:
- handles most of the orchestration for you (keeping track of the context, managing iterations, etc.)
- lets you use observability tools like LangSmith to see exactly what the agent is doing at each step (more on that later).
- provides a lot of flexibility (handles streaming responses, makes it easy to add tools or sub-agents, etc.)
So clearly, agentic RAG is more powerful but it may also be slower and more expensive because of the multiple retrieval iterations. Also, you don't get to predict exactly what the agent will search for, so you have to make sure your prompts guide the agent properly.
Testing and observability
As we saw throughout the article, building a RAG is a process of trial and error that involves many moving parts: sometimes you tweak the chunking, then swap the embedding model, change the reranking threshold, the prompts, etc. So you definitely need to know whether each change actually made things better or worse.
Benchmarking
I mostly use RAG benchmarks as non-regression tests, that is make sure a change didn't kill performance on questions that were working before. What you want is a set of question/answer pairs where you know that the answer is correct (also good if you have the relevant chunks).
Building this benchmark is a lot of work, but it pays off: because without it, you have no idea if your changes are actually improving the system or not. Start with like 20-30 questions and make sure they are correctly balanced depending on your documentation: if your documentation has 30% of HR policies, 30% of IT policies, and 40% of legal documents, then your benchmark should roughly reflect that distribution.
LLM-as-a-judge
Once you have that benchmark, you can test your RAG system against it every time you make a significant change. In practice, you feed an LLM the question, the generated answer, and the ground truth answer from your benchmark, and ask it to score the generated answer on dimensions like:
- Correctness: Does the generated answer match the ground truth?
- Completeness: Does it cover all the key points from the ground truth?
- Faithfulness: Does it only use information from the provided context, without hallucinating?
In practice, it looks something like this:
1def llm_judge(question: str, answer: str, ground_truth: str) -> dict:
2 prompt = f"""You are evaluating an AI-generated answer.
3
4Question: {question}
5Generated Answer: {answer}
6Ground Truth: {ground_truth}
7
8Score the answer on each dimension (1-5):
9- Correctness: Does it match the ground truth?
10- Completeness: Does it cover all key points?
11- Faithfulness: Does it only use information from the provided context?
12
13Respond in JSON: {{"correctness": int, "completeness": int, "faithfulness": int}}"""
14
15 return json.loads(llm.invoke(prompt))You can also use frameworks like Ragas that automate this process and provide nice dashboards to track your RAG performance over time.
Observability
If you use such agentic RAG approaches, you will need to see what's actually happening inside the pipeline in production. Tools like LangSmith and Langfuse let you trace everything:
- what query came in,
- what intermediate steps the agent took,
- what chunks were retrieved,
- what prompt was constructed,
- how long each step took (useful for finding bottlenecks),
- what the model generated, etc.
Langfuse is open-source and self-hostable, which is nice for production if you don't want to send data to third parties (even though it's pretty heavy to run). LangSmith is good too and has tighter integration with LangChain if you're already in that ecosystem.
Beyond search: dynamic vector databases for AI agents
Before we finish, I want to say that RAG is not limited to this whole "retrieve documents to answer questions" thing. You can also use it to give AI agents access to dynamic, real-time memories.
This is actually a real thing from the 2023 Stanford/Google paper "Generative Agents: Interactive Simulacra of Human Behavior", where 25 AI agents lived in a virtual town, talked to each other, formed relationships, made plans, and were able to recall relevant memories to guide their behavior. For this, each memory was stored as a vector in the vector database, which was therefore dynamically updated as the agents lived their lives.
Their memory database looked something like this:
| Field | Type | Description |
|---|---|---|
vector | float[] | The embedding of the memory content |
content | string | The memory itself (e.g., "Had a conversation with Alice about the election") |
importance | int (1-10) | How significant this memory is |
recency | float | Exponentially decaying score based on time since creation |
That importance score was typically LLM-generated with a prompt like:
1On the scale of 1 to 10, where 1 is purely mundane
2(e.g., brushing teeth, making bed) and 10 is
3extremely poignant (e.g., a break up, college
4acceptance), rate the likely poignancy of the
5following piece of memory.
6Memory: buying groceries at The Willows Market
7and Pharmacy
8Rating: <fill in>So when an agent needed to take action (decide what to say, where to go, how to react), it would recall memories that were not just semantically similar to the current situation, but also recent and important. Pretty much like we do in real life, actually. The final retrieval score was a weighted combination of all three.
This is exactly the kind of use cases I mentioned earlier, where you do need custom server side scoring functions. As far as I know, most vector databases can't do this natively. But I know that OpenSearch/Elasticsearch can, with their script_score queries:
1{
2 "query": {
3 "script_score": {
4 "query": {
5 "match_all": {}
6 },
7 "script": {
8 "source": "(cosineSimilarity(params.query_vector, 'vector') + 1.0) * doc['importance'].value * decayDateExp(params.origin, params.scale, params.offset, params.decay, doc['created_at'].value)",
9 "params": {
10 "query_vector": [0.12, -0.34, 0.56, "..."],
11 "origin": "now",
12 "scale": "7d",
13 "offset": "0",
14 "decay": 0.5
15 }
16 }
17 }
18 }
19}This computes the final score of each vector as a product of three signals:
- cosine similarity to the query vector (shifted by +1.0 to keep it positive)
- the
importancevalue, - an exponential time decay from
created_at.
So a highly important, recent, and relevant memory surfaces to the top. The takeaway here is that RAG is really just a pattern for giving AI systems access to structured memory, whether that memory is a knowledge base, a conversation history, or whatever else.
Conclusion
In the end, RAG is not going anywhere. Sure, the original "embed and search" approach might be limited for most use cases, but the core idea of RAG is here to stay and can be used in many different ways like we saw.
So if you're building a RAG system, start simple and work your way up. And don't forget that the best RAG is the one you actually ship and that answers questions correctly.
