Evaluating RAG Pipelines

Evaluation of a RAG pipeline is challenging because it has many components. Each stage, from retrieval to generation and post-processing, requires targeted metrics. Traditional evaluation methods fall short in capturing human judgment, and many teams underestimate the effort required, leading to incomplete or misleading performance assessments.

RAG evaluation should be approached across three dimensions: performance, cost, and latency. Metrics like Recall@k, Precision@k, MRR, F1 score, and qualitative indicators help assess how well each part of the system contributes to the final output.

The optimization of a RAG pipeline can be divided into pre-processing (pre-retrieval), processing (retrieval and generation), and post-processing (post-generation) stages. Each stage is optimized locally, as global optimization is not possible due to the exponentially many choices for hyperparameters.

The pre-processing stage improves how knowledge is chunked, embedded, and stored, ensuring that user queries are clear and contextual. The processing stage tunes the retriever and generator for better relevance, ranking, and response quality. The post-processing stage adds final checks for hallucinations, safety, and formatting before displaying the output to the end user.

Retrieval-augmented generation (RAG) is a technique for augmenting the generative capabilities of a large language model (LLM) by integrating it with information retrieval techniques. Instead of relying solely on the model’s pre-trained knowledge, RAG allows the system to pull in relevant external information at the time of the query, making responses more accurate and up-to-date.

Since its introduction by Lewis et al. in 2020, RAG has become the go-to technique for incorporating external knowledge into the LLM pipeline. According to research published by Microsoft in early 2024, RAG consistently outperforms unsupervised fine-tuning for tasks that require domain-specific or recent information.

At a high level, here’s how RAG works:

1. The user poses a question to the system, known as the query, which is transformed into a vector using an embedding model.

2. The retriever pulls the documents most relevant to the query from a collection of embedded documents stored in a vector database. These documents come from a larger collection, often referred to as a knowledge base.

3. The query and retrieved documents are passed to the LLM, the generator, which generates the response grounded in both the input and the retrieved content.

In production systems, this basic pipeline is often extended with additional steps, such as data cleaning, filtering, and post-processing, to improve the quality of the LLM response.

A typical RAG system consists of three components: a knowledge base, a retriever, and a generator. The knowledge base is made up of documents embedded and stored in a vector database. The retriever uses the embedded user query to select relevant documents from the knowledge base and passes the corresponding text documents to the generator—the large language model—which produces a response based on the query and the retrieved content. | Source: Author

In my experience of developing multiple RAG products, it is easy to build a RAG proof of concept (PoC) to demonstrate its business value. However, like with any complex software system, evolving from a PoC over a minimum viable product (MVP) and, eventually, to a production-ready system requires thoughtful architecture design and testing.

One of the challenges that sets RAG systems apart from other ML workflows is the absence of standardized performance metrics and ready-to-use evaluation frameworks. Unlike traditional models where accuracy, F1-score, or AUC may suffice, evaluating a RAG pipeline is more subtle (and often neglected). Many RAG product initiatives stall after the PoC stage because the teams involved underestimate the complexity and importance of evaluation.

In this article, I share practical guidance based on my experience and recent research for planning and executing effective RAG evaluations. We’ll cover:

Dimensions for evaluating a RAG pipeline.
Common challenges in the evaluation process.
Metrics that help track and improve performance.
Strategies to iterate and refine RAG pipelines.

Dimensions of RAG evaluation

Evaluating a RAG pipeline means assessing its behavior across three dimensions:

1. Performance: At its core, performance is the ability of the retriever to retrieve documents relevant to the user query and the generator’s ability to craft an appropriate response using those documents.

2. Cost: A RAG system incurs set-up and operational costs. The setup costs include hardware or cloud services, data acquisition and collection, security and compliance, and licensing. Day-to-day, a RAG system incurs costs for maintaining and updating the knowledge base as well as querying LLM APIs or hosting an LLM locally.

3. Latency: Latency measures how quickly the system takes to respond to a user query. The main drivers are typically embedding the user query, retrieving relevant documents, and generating the response. Preprocessing and postprocessing steps that are frequently necessary to ensure reliable and consistent responses also contribute to latency.

Why is the evaluation of a RAG pipeline challenging?

The evaluation of a RAG pipeline is challenging for several reasons:

1. RAG systems can consist of many components.

What starts as a simple retriever-generator setup often evolves into a pipeline with multiple components: query rewriting, entity recognition, re-ranking, content filtering, and more.

Each addition introduces a variable that affects performance, costs, and latency, and they must be evaluated both separately and in the context of the overall pipeline.

2. Evaluation metrics fail to fully capture human preferences.

Automatic evaluation metrics continue to improve, but they often miss the mark when compared to human judgment.

For example, the tone of the response (e.g., professional, casual, helpful, or direct) is an important evaluation criterion. Consistently hitting the right tone can make or break a product such as a chatbot. However, capturing tonal nuances with a simple quantitative metric is hard to grasp: an LLM might score high on factuality but still feel off-brand or unconvincing in tone, and this is subjective.

Thus, we’ll have to rely on human feedback to assess whether a RAG pipeline meets the expectations of product owners, subject matter experts, and, ultimately, the end customers.

3. Human evaluation is expensive and time-consuming.

While human feedback remains the gold standard, it’s labor-intensive and expensive. Because RAG pipelines are sensitive to even minor tweaks, you’ll often need to re-evaluate after every iteration, and this approach is generally expensive and time-consuming.

How to evaluate a RAG pipeline

If you cannot measure it, you cannot improve it.

Peter Drucker

In one of my earlier RAG projects, our team relied heavily on “eyeballing” outputs, that is, spot-checking a few responses to assess quality. While useful for early debugging, this approach quickly breaks down as the system grows. It’s susceptible to recency bias and leads to optimizing for a handful of recent queries instead of robust, production-scale performance.

This leads to overfitting and a misleading impression of the system’s production readiness. Therefore, RAG systems need structured evaluation processes that address all three dimensions (performance, cost, and latency) over a representative and diverse set of queries.

While assessing costs and latency is relatively straightforward and can draw from decades of experience gathered through operating traditional software systems, the lack of quantitative metrics and the subjective nature make performance evaluation a messy process. However, this is all the more reason why an evaluation process must be put in place and iteratively evolved over the product’s lifetime.

The evaluation of the RAG pipeline is a multi-step process, starting with creating an evaluation dataset, then evaluating the individual components (retriever, generator, etc.), and performing end-to-end evaluation of the full pipeline. In the following sections, I will discuss the creation of an evaluation dataset, metrics for evaluation, and optimization of the performance of the pipeline.

Curating an evaluation dataset

The first step in the RAG evaluation process is the creation of a ground truth dataset. This dataset consists of queries, chunks relevant to the queries, and associated responses. It can either be human-labeled, created synthetically, or a combination of both.

Here are some points to consider:

The queries can either be written by the subject matter experts (SMEs) or generated via an LLM, followed by the selection of useful questions by the SMEs. In my experience, LLMs generally end up generating simplistic questions based on exact sentences in the documents.

For example, if a document contains the sentence, “Barack Obama was the 44th president of the United States.”, the chances of generating the question, “Who was the 44th president of the United States?” is high. However, such simplistic questions are not useful for the purpose of evaluation. That’s why I recommend that SMEs select questions from those generated by the LLM.

Make sure your evaluation queries the conditions expected in production in topic, style, and complexity. Otherwise, your pipeline might perform well on test data but fail in practice.
While creating a synthetic dataset, first calculate the mean number of queries needed to answer a query based on the sampled set of queries. Now, retrieve a few more documents per query using the retriever that you plan to utilize in production.
Once you retrieve candidate documents for each query (using your production retriever), you can label them as relevant or irrelevant (0/1 binary labeling) or give a score between 1 to n for relevance. This helps build fine-grained retrieval metrics and identify failure points in document selection.
For a human-labeled dataset, SMEs can provide high-quality “gold” responses per query. For a synthetic dataset, you can generate several candidate responses and score them across relevant generation metrics.

Creation of human-labeled and synthetic ground truth datasets for evaluation of a RAG pipeline. The first step is to select a representative set of sample queries. To generate a human-labeled dataset, use a simple retriever like BM25 to identify a few chunks per query (5-10 is generally sufficient) and let subject-matter experts (SMEs) label these chunks as relevant or non-relevant. Then, have the SMEs write sample responses without directly utilizing the chunks. To generate a synthetic dataset, first identify the mean number of chunks needed to answer the queries in the evaluation dataset. Then, use the RAG system’s retriever to identify a few more than k chunks per query (k is the average number of chunks typically required to answer a query). Then, use the same generator LLM used in the RAG system to generate the responses. Finally, have SMEs evaluate those responses based on use-case-specific criteria. — Creation of human-labeled and synthetic ground truth datasets for evaluation of a RAG pipeline. The first step is to select a representative set of sample queries.
To generate a human-labeled dataset, use a simple retriever like BM25 to identify a few chunks per query (5-10 is generally sufficient) and let subject-matter experts (SMEs) label these chunks as relevant or non-relevant. Then, have the SMEs write sample responses without directly utilizing the chunks.

To generate a synthetic dataset, first identify the mean number of chunks needed to answer the queries in the evaluation dataset. Then, use the RAG system’s retriever to identify a few more than k chunks per query (k is the average number of chunks typically required to answer a query). Then, use the same generator LLM used in the RAG system to generate the responses. Finally, have SMEs evaluate those responses based on use-case-specific criteria. | Source: Author

Evaluation of the retriever

Retrievers typically pull chunks from the vector database and rank them based on similarity to the query using methods like cosine similarity, keyword overlap, or a hybrid approach. To evaluate the retriever’s performance, we evaluate both what it retrieves and where those relevant chunks appear in the ranked list.

The presence of the relevant chunks is measured by non-rank-based metrics, and presence and rank are measured collectively by rank-based metrics.

Non-rank based metrics

These metrics check whether relevant chunks are present in the retrieved set, regardless of their order.

1. Recall@k measures the number of relevant chunks out of all the top-k retrieved chunks.

For example, if a query has eight relevant chunks and the retriever retrieves k = 10 chunks per query, and five out of the eight relevant chunks are present among the top 10 ranked chunks, Recall@10 = 5/8 = 62.5%.

Examples of Recall@k for different cutoff values (k = 5 and k = 10). Each row represents a retrieved chunk, colored by relevance: red for the relevant, grey for the not relevant. In these examples, each retrieval consists of 15 chunks. There are 8 relevant chunks in total.

In the example on the left, there are 5 out of 8 relevant chunks within the cutoff k = 10, and in the example on the right, there are 3 out of 8 relevant chunks within the cutoff k = 5. As k increases, more relevant chunks are retrieved, resulting in higher recall but potentially more noise. — Examples of Recall@k for different cutoff values (k = 5 and k = 10). Each row represents a retrieved chunk, colored by relevance: red for the relevant, grey for the not relevant. In these examples, each retrieval consists of 15 chunks. There are 8 relevant chunks in total.
In the example on the left, there are 5 out of 8 relevant chunks within the cutoff k = 10, and in the example on the right, there are 3 out of 8 relevant chunks within the cutoff k = 5. As k increases, more relevant chunks are retrieved, resulting in higher recall but potentially more noise. | Modified based on: sou r ce

The recall for the evaluation dataset is the mean of the recall for all individual queries.

Recall@k increases with an increase in k. While a higher value of k means that – on average – more relevant chunks reach the generator, it generally also means that more irrelevant chunks (noise) are passed on.

2. Precision@k measures the number of relevant chunks as a fraction of the top-k retrieved chunks.

For example, if a query has seven relevant chunks and the retriever retrieves k = 10 chunks per query, and six out of seven relevant chunks are present among the 10 chunks, Precision@10 = 6/10 = 60%.

Precision@k for two different cutoff values (k = 10 and k = 5). Each bar represents a retrieved chunk, colored by relevance: red for relevant, gray for not relevant.

At k = 5, 4 out of 5 retrieved chunks are relevant, resulting in a high Precision@5 of ⅘ = 0.8. At k = 10, 6 out of 10 retrieved chunks are relevant, so the Precision@10 is 6/10 = 0.6. This figure highlights the precision-recall trade-off: increasing k often retrieves more relevant chunks (higher recall) but also introduces more irrelevant ones, which lowers precision. — Precision@k for two different cutoff values (k = 10 and k = 5). Each bar represents a retrieved chunk, colored by relevance: red for relevant, gray for not relevant.
At k = 5, 4 out of 5 retrieved chunks are relevant, resulting in a high Precision@5 of ⅘ = 0.8. At k = 10, 6 out of 10 retrieved chunks are relevant, so the Precision@10 is 6/10 = 0.6. This figure highlights the precision-recall trade-off: increasing k often retrieves more relevant chunks (higher recall) but also introduces more irrelevant ones, which lowers precision. | Modified based on: source

The highly relevant chunks are typically present among the first few retrieved chunks. Thus, lower values of k tend to lead to higher precision. As k increases, more irrelevant chunks are retrieved, leading to a decrease in Precision@k.

The fact that precision and recall tend to move in opposite directions as k varies is known as the precision-recall trade-off. It’s vital to balance both metrics to achieve optimal RAG performance and not overly focus on just one of them.

Rank-based metrics

These metrics take the chunk’s rank into account, helping assess how well the retriever ranks relevant information.

1. Mean reciprocal rank (MRR) looks at the position of the first relevant chunk. The earlier it appears, the better.

If the first relevant chunk out of the top-k retrieved chunks is present at rank i, then the reciprocal rank for the query is equal to 1/i. The mean reciprocal rank is the mean of reciprocal ranks over the evaluation dataset.

MRR ranges from 0 to 1, where MRR = 0 means no relevant chunk is present among retrieved chunks, and MRR = 1 means that the first retrieved chunk is always relevant.

However, note that MRR only considers the first relevant chunk, disregarding the presence and ranks of all other relevant chunks retrieved. Thus, MRR is best suited for cases where a single chunk is enough to answer the query.

2. Mean average precision (MAP) is the mean of the average Precision@k values for all k. Thus, MAP considers both the presence and ranks of all the relevant chunks.

MAP ranges from 0 to 1, where MAP = 0 means that no relevant chunk was retrieved for any query in the dataset, and MAP = 1 means that all relevant chunks were retrieved and placed before any irrelevant chunk for every query.

MAP considers both the presence and rank of relevant chunks but fails to consider the relative position of relevant chunks. As some chunks in the knowledge base may be more relevant in answering the query, the order in which relevant chunks are retrieved is also important, a factor that MAP does not account for. Due to this limitation, this metric is good for evaluating comprehensive retrieval but limited when some chunks are more critical than others.

3. Normalized Discounted Cumulative Gain (NDCG) evaluates not just whether relevant chunks are retrieved but how well they’re ranked by relevance. It compares actual chunk ordering to the ideal one and is normalized between 0 and 1.

To calculate it, we first compute the Discounted Cumulative Gain (DCG@k), which rewards relevant chunks more when they appear higher in the list: the higher the rank, the smaller the reward (users usually care more about top results).

Next, we compute the Ideal DCG (IDCG@k), the DCG we would get if all relevant chunks were perfectly ordered from most to least relevant. IDCG@k serves as the upper bound, representing the best possible ranking.

The Normalized DCG is then:

NDCG values range from 0 to 1:

1 indicates a perfect ranking (relevant chunks appear in the best possible order)
0 means all relevant chunks are ranked poorly

To evaluate across a dataset, simply average the NDCG@k scores for all queries. NDCG is often considered the most comprehensive metric for retriever evaluation because it considers the presence, position, and relative importance of relevant chunks.

Evaluation of the generator

The generator’s role in a RAG pipeline is to synthesize a final response using the user query, the retrieved document chunks, and any prompt instructions. However, not all retrieved chunks are equally relevant and sometimes, the most relevant chunks might not be retrieved at all. This means the generator needs to decide which chunks to actually use to generate its answer. The chunks the generator actually utilizes are referred to as “cited chunks” or “citations.”

To make this process interpretable and evaluable, we typically design the generator prompt to request explicit citations of sources. There are two common ways to do this in the model’s output:

Inline references like [1], [2] at the end of sentences
A “Sources” section at the end of the answer, where the model identifies which input chunks were used.

Consider the following real prompt and generated output:

Example of a real prompt and generated output. — Source: Author

This response correctly synthesizes the retrieved facts and transparently cites which chunks were used in forming the answer. Including the citations in the output serves two purposes:

It builds user trust in the generated response, showing exactly where the facts came from
It enables the evaluation, letting us measure how well the generator used the retrieved content

However, the quality of the answer isn’t solely determined by retrieval; the LLM utilized in the generator may not be able to synthesize and contextualize the retrieved information effectively. This can lead to the generated response being incoherent, incomplete, or including hallucinations.

Accordingly, the generator in a RAG pipeline has to be evaluated in two dimensions:

The ability of the LLM to identify and utilize relevant chunks among the retrieved chunks. This is measured using two citation-based metrics, Recall@k and Precision@k.

The quality of the synthesized response. This is measured using a response-based metric (F1 score at the token level) and qualitative indicators for completeness, relevancy, harmfulness, and consistency.

Citation-based metrics

Recall@k is defined as the proportion of relevant chunks that were cited compared to the total number of relevant chunks in the knowledge base for the query.
It is an indicator of the joint performance of the retriever and the generator. For the retriever, it indicates the ability to rank relevant chunks higher. For the generator it measures whether the relevant chunks are chosen to generate the response.
Precision@k is defined as the proportion of cited chunks that are actually relevant (the number of cited relevant chunks compared to the total number of cited chunks).
It is an indicator of the generator’s ability to identify relevant chunks from those provided by the retriever.

Response-based metrics

While citation metrics assess whether a generator selects the right chunks, we also need to evaluate the quality of the generated response itself. One widely used method is the F1 score at the token level, which measures how closely the generated answer matches a human-written ground truth.

F1 score at token level

The F1 score combines precision (how much of the generated text is correct) and recall (how much of the correct answer is included) into a single value. It’s calculated by comparing the overlap of tokens (typically words) between the generated response and the ground truth sample. Token overlap can be measured as the overlap of individual tokens, bi-grams, trigrams, or n-grams.

The F1 score at the level of individual tokens is calculated as follows:

Tokenize the ground truth and the generated responses. Let’s see an example:

Ground truth response: He eats an apple. → Tokens: he, eats, an, apple
Generated response: He ate an apple. → Tokens: he, ate, an, apple

Count the true positive, false positive, and false negative tokens in the generated response. In the previous example, we count:

True positive tokens (correctly matched tokens): 3 (he, an, apple)
False positive tokens (extra tokens in the generated response): 1 (ate)
False negative tokens (missing tokens from the ground truth): 1 (eats)

Calculate precision and recall. In the given example:

Recall = TP/(TP+FN) = 3/(3+1) = 0.75
Precision = TP/(TP+FP) = 3/(3+1) = 0.75

Calculate the F1 score. Let’s see how:
F1 Score = 2 * Recall * Precision / (Precision + Recall) = 2 * 0.75 * 0.75 / (0.75 + 0.75) = 0.75

This approach is simple and effective when evaluating short, factual responses. However, the longer the generated and ground truth responses are, the more diverse they tend to become (e.g., due to the use of synonyms and the ability to reflect tone in the response). Hence, even responses that convey the same information in a similar style generally don’t have a high token-level similarity.

Metrics like BLEU and ROUGE, commonly used in text summarization or translation, can also be applied to evaluate LLM-generated responses. However, they assume a fixed reference response and thus penalize valid generations that use different phrasing or structure. This makes them less suitable for tasks where semantic equivalence matters more than exact wording.

That said, BLEU, ROUGE, and similar metrics can be helpful in some contexts—particularly for summarization or template-based responses. Choosing the right evaluation metric depends on the task, the output length, and the degree of linguistic flexibility allowed.

Qualitative indicators

Not all aspects of response quality can be captured by numerical metrics. In practice, qualitative evaluation plays an important role in assessing how useful, safe, and trustworthy a response feels—especially in user-facing applications.

The quality dimensions that matter the most depend on the use case and can either be assessed by subject matter experts, other annotators, or by using an LLM as a judge (which is increasingly common in automated evaluation pipelines).

Some of the common quality indicators in the context of RAG pipelines are:

Completeness: Does the response answer the query fully?
Completeness is an indirect measure of how well the prompt is written and how informative the retrieved chunks are.

Relevancy: Is the generated answer relevant to the query?
Relevancy is an indirect measure of the ability of the retriever and generator to identify relevant chunks.

Harmfulness: Has the generated response the potential to cause harm to the user or others?
Harmfulness is an indirect measure of hallucination, factual errors (e.g., getting a math calculation wrong), or oversimplifying the content of the chunks to give a succinct answer, leading to loss of essential information.

Consistency: Is the generated answer in sync with the chunks provided to the generator?
A key signal for hallucination detection in the generator’s output—if the model makes unsupported claims, consistency is compromised.

End-to-end evaluation

In an ideal world, we’d be able to summarize the effectiveness of a RAG pipeline with a single, reliable metric that fully reflects how well all the components work together. If that metric crossed a certain threshold, we’d know the system was production-ready. Unfortunately, that’s not realistic.

RAG pipelines are multi-stage systems, and each stage can introduce variability. On top of it, there’s no universal way to measure whether a response aligns with human preferences. The latter problem is only exacerbated by the subjectiveness with which humans judge textual responses.

Additionally, the performance of a downstream component depends on the quality of upstream components. No matter how good your generator prompt is, it will perform poorly if the retriever fails to identify relevant documents – and if there are no relevant documents in the knowledge base, optimizing the retriever will not help.

In my experience, it’s helpful to approach the end-to-end evaluation of RAG pipelines from the end user’s perspective. The end user asks a question and gets a response. They do not care about the internal workings of the system. Thus, only the quality of the generated responses and overall latency matter.

That’s why, in most cases, we use generator-focused metrics like the F1 score or human-judged quality as a proxy for end-to-end performance. Component-level metrics (for retrievers, rankers, etc.) are still valuable, but mostly as diagnostic tools to determine which components are the most promising starting points for improvement efforts.

Optimizing the performance of a RAG pipeline

The first step toward a production-ready RAG pipeline is to establish a baseline. This typically involves setting up a naive RAG pipeline using the simplest available options for each component: a basic embedding model, a straightforward retriever, and a general-purpose LLM.

Once this baseline is implemented, we use the evaluation framework discussed earlier to assess the system’s initial performance. This includes:

Retriever metrics, such as Recall@k, Precision@k, MRR, and NDCG.
Generator metrics, including citation precision and recall, token-level F1 score, and qualitative indicators such as completeness and consistency.
Operational metrics, such as latency and cost.

Once we’ve collected baseline values across key evaluation metrics, the real work begins: systematic optimization. From my experience, it’s most effective to break this process into three stages: pre-processing, processing, and post-processing.

Each stage builds on the previous one, and changes in upstream components often impact downstream behavior. For example, improvement in the performance of the retriever via query enhancement techniques affects the quality of generated responses.

However, the reverse is not true, i.e., if the performance of the generator is improved by better quality prompts, it does not affect the performance of the retriever. This unidirectional impact of changes in the RAG pipeline provides us with the following framework for optimizing the pipeline. Therefore, we evaluate and optimize each stage sequentially, focusing only on the components from the current stage onward.

The three stages of RAG pipeline optimization. Pre-processing focuses on chunking, embedding, vector storage, and query refinement. Processing includes retrieval and generation using tuned algorithms, LLMs, and prompts. Post-processing ensures response quality through safety checks, tone adjustments, and formatting. | Source: Author

Stage 1: Pre-processing

This phase focuses on everything that happens before retrieval. Optimization efforts here include:

Refining the chunking strategy
Improving the document indexing
Utilizing metadata to filter or group content
Applying query rewriting, query expansion, and routing
Performing entity extraction to sharpen the query intent

Optimizing the knowledge base (KB)

When Recall@k is low (suggesting the retriever is not surfacing relevant content) or citation precision is low (indicating many irrelevant chunks are being passed to the generator), it’s often a sign that relevant content isn’t being found or used effectively. This points to potential problems in how documents are stored and chunked. By optimizing the knowledge base along the following dimensions, these problems can be mitigated:

1. Chunking Strategy

There are several reasons why documents must be split into chunks:

Context window limitations: A single document may be too large to fit into the context of the LLM. Splitting it allows only relevant segments to be passed into the model.
Partial relevance: Multiple documents or different parts of a single document may contain useful information for answering a query.
Improved embeddings: Smaller chunks tend to produce higher-quality embeddings because fewer unrelated tokens are projected into the same vector space.

Poor chunking can lead to decreased retrieval precision and recall, resulting in downstream issues like irrelevant citations, incomplete answers, or hallucinated responses. The criterion for chunking strategy depends on the type of documents being dealt with.

Naive chunking: For plain text or unstructured documents (e.g., novels, transcripts), use a simple fixed-size token-based approach. This ensures uniformity but may break semantic boundaries, leading to noisier retrieval.

Logical chunking: For structured content (e.g., manuals, policy documents, HTML or JSON files), divide the document semantically using sections, subsections, headers, or markup tags. This retains meaningful context within each chunk and allows the retriever to distinguish content more effectively.

Logical chunking typically results in better-separated embeddings in the vector space, improving both retriever recall (due to easier identification of relevant content) and retriever precision (by reducing overlap between semantically distinct chunks). These improvements are often reflected in higher citation recall and more grounded, complete generated responses.

2. Chunk Size

Chunk size impacts embedding quality, retriever latency, and response diversity. Very small chunks can lead to fragmentation and noise, while excessively large chunks may reduce embedding effectiveness and cause context window inefficiencies.

A good strategy I utilize in my projects is to perform logical chunking with the maximum possible chunk size (say a few hundred to a couple of thousand tokens). If the size of the section/subsection goes beyond the maximum token size, it is divided into two or more chunks. This strategy gives longer chunks that are semantically and structurally logical, leading to improved retrieval metrics and more complete, diverse responses without significant latency trade-offs.

3. Metadata

Metadata filtering allows the retriever to narrow its search to more relevant subsets of the knowledge base. When Precision@k is low or the retriever is overwhelmed with irrelevant matches, adding metadata (e.g., document type, department, language) can significantly improve retrieval precision and reduce latency.

Optimizing the user query

Poor query formulation can significantly degrade retriever and generator performance even with a well-structured knowledge base. For example, consider the query: “Why is a keto diet the best form of diet for weight loss?”.

This question contains a built-in assumption—that the keto diet is the best—which biases the generator into affirming that claim, even if the supporting documents present a more balanced or contrary view. While relevant articles may still be retrieved, the framing of the response will likely reinforce the incorrect assumption, leading to a biased, potentially harmful, and factually incorrect output.

If the evaluation surfaces issues like low Recall@k, low Precision@k (especially for vague, overly short, or overly long queries), irrelevant or biased answers (especially when queries contain assumptions), or poor completeness scores, the user query may be the root cause. To improve the response quality, we can apply these query preprocessing strategies:

Query rewriting

Short or ambiguous queries like “RAG metrics” or “health insurance” lack context and intent, resulting in low recall and ranking precision. A simple rewriting step using an LLM, guided by in-context examples developed with SMEs, can make them more meaningful:

From “RAG metrics” → “What are the metrics that can be used to measure the performance of a RAG system?”
From “Health insurance” → “Can you tell me about my health insurance plan?”

This improves retrieval accuracy and boosts downstream F1 scores and qualitative ratings (e.g., completeness or relevance).

Adding context to the query

A vice president working in the London office of a company types “What is my sabbatical policy?”. Because the query doesn’t mention their role or location, the retriever surfaces general or US-based policies instead of the relevant UK-specific document. This results in an inaccurate or hallucinated response based on an incomplete or non-applicable context.

Instead, if the VP types “What is the sabbatical policy for a vice president of [company] in the London office?” the retriever can more accurately identify relevant documents, improving retrieval precision and reducing ambiguity in the answer. Injecting structured user metadata into the query helps guide the retriever toward more relevant documents, improving both Precision@k and the factual consistency of the final response.

Simplifying overly long queries

A user submits the following query covering multiple subtopics or priorities: “I’ve been exploring different retirement investment options in the UK, and I’m particularly interested in understanding how pension tax relief works for self-employed individuals, especially if I plan to retire abroad. Can you also tell me how it compares to other retirement products like ISAs or annuities?”

This query includes multiple subtopics (pension tax relief, retirement abroad, product comparison), making it difficult for the retriever to identify the primary intent and return a coherent set of documents. The generator will likely respond vaguely or focus only on one part of the question, ignoring or guessing the rest.

If the user focuses the query on a single intent instead, asking “How does pension tax relief work for self-employed individuals in the UK?”, retrieval quality improves (higher Recall@k and Precision@k), and the generator is more likely to produce a complete, accurate output.

To support this, a helpful mitigation strategy is to implement a token-length threshold: if a user query exceeds a set number of tokens, it is rewritten (manually or via an LLM) to be more concise and focused. This threshold is determined by looking at the distribution of the size of the user requests for the specific use case.

Query routing

If your RAG system serves multiple domains or departments, misrouted queries can lead to high latency and irrelevant retrievals. Using intent classification or domain-specific rules can direct queries to the correct vector database or serve cached responses for frequently asked questions. This improves latency and consistency, particularly in multi-tenant or enterprise environments.

Optimizing the vector database

The vector database is central to retrieval performance in a RAG pipeline. Once documents in the knowledge base are chunked, they are passed through an embedding model to generate high-dimensional vector representations. These vector embeddings are then stored in a vector database, where they can be efficiently searched and ranked based on similarity to an embedded user query.

If your evaluation reveals low Recall@k despite the presence of relevant content, poor ranking metrics such as MRR or NDCG, or high retrieval latency (particularly as your knowledge base scales), these symptoms often point to inefficiencies in how vector embeddings are stored, indexed, or retrieved. For example, the system may retrieve relevant content too slowly, rank it poorly, or generate generic chunks that don’t align with the user’s query context (leading to off-topic outputs from the generator).

To address it, we need to select the appropriate vector database technology and configure the embedding model to match the use case in terms of domain relevance and vector size.

Choosing the right vector database

Dedicated vector databases (e.g., Pinecone, Weaviate, OpenSearch) are designed for fast, scalable retrieval in high-dimensional spaces. They typically offer better indexing, retrieval speed, metadata filtering, and native support for change data capture. These are important as your knowledge base grows.

In contrast, extensions to relational databases (such as pgvector in PostgreSQL) may suffice for small-scale or low-latency applications but often lack some other advanced features.

I recommend using a dedicated vector database for most RAG systems, as they are highly optimized for storage, indexing, and similarity search at scale. Their advanced capabilities tend to significantly improve both retriever accuracy and generator quality, especially in complex or high-volume use cases.

Embedding model selection

Embedding quality directly impacts the semantic accuracy of retrieval. There are two factors to consider here:

Domain relevance: Use a domain-specific embedding model (e.g., BioBE R T for medical text) for specialized use cases. For general applications, high-quality open embeddings like OpenAI’s models usually suffice.
Vector size: Larger embedding vectors capture the nuances in the chunks better but increase storage and computation costs. If your vector database is small (e.g.,

Stage 2: Processing

This is where the core RAG mechanics happen: retrieval and generation. The decisions for the retriever include choosing the optimal retrieval algorithm (dense retrieval, hybrid algorithms, etc.), type of retrieval (exact vs approximate), and reranking of the retrieved chunks. For the generator, these decisions pertain to choosing the LLM, refining the prompt, and setting the temperature.

At this stage of the pipeline, evaluation results often reveal whether the retriever and generator are working well together. You might see issues like low Recall@k or Precision@k, weak citation recall or F1 scores, hallucinated responses, or high end-to-end latency. When these show up, it’s usually a sign that something’s off in either the retriever or the generator, both of which are key areas to focus on for improvement.

Optimizing the retriever

If the retriever performs poorly (it has either low recall, precision, MRR, or NDCG), the generator will receive irrelevant documents. It will then generate factually incorrect and hallucinated responses as it will try to fill the gaps among the retrieved articles from its internal knowledge.

The mitigation strategies for poor retrieval include the following:

Ensuring data quality in the knowledge base

The retriever’s quality is constrained by the quality of the documents in the knowledge base. If the documents in the knowledge base are unstructured or poorly maintained, they may result in overlapping or ambiguous vector embeddings. This makes it harder for the retriever to distinguish between relevant and irrelevant content. Clean, logically chunked documents improve both retrieval recall and precision, as covered in the pre-processing stage.

Choose the optimal retrieval algorithm

Retrieval algorithms fall into two categories:

Sparse retrievers (e.g., BM25) rely on keyword overlap. They are fast, explainable, and can embed long documents with ease, but they struggle with semantic matching. They are exact match algorithms as they identify relevant chunks for a query based on an exact match of keywords. Because of this feature, they generally perform poorly at tasks that involve semantic similarity search such as question answering or text summarization.
Dense retrievers embed queries and chunks in a continuous vector space and identify relevant chunks based on similarity scores. They generally offer better performance (higher recall) due to semantic matching but are slower than sparse retrievers. However, to this day, dense retrievers are still very fast and are rarely the source of high latency in any use case. Therefore, whenever possible, I recommend using either a dense retrieval algorithm or a hybrid of sparse and dense retrieval, e.g.: rank-fusion. A hybrid approach leverages the precision of sparse algorithms and the flexibility of dense embeddings.

Apply re-ranking

Even when the retriever pulls the right chunks, they don’t always show up at the top of the list. That means the generator might miss the most useful context. A simple way to fix this is by adding a re-ranking step—using a dense model or a lightweight LLM—to reshuffle the results based on deeper semantic understanding. This can make a big difference, especially when you’re working with large knowledge bases where the chunks retrieved in the first pass all have very high and similar similarity scores. Re-ranking helps bring the most relevant information to the top, improving how well the generator performs and boosting metrics like MRR, NDCG, and overall response quality.

Optimizing the generator

The generator is responsible for synthesizing a response based on the chunks retrieved from the retriever. It is the biggest source of latency in the RAG pipeline and also where a lot of quality issues tend to surface, especially if the inputs are noisy or the prompt isn’t well-structured.

You might notice slow responses, low F1 scores, or inconsistent tone and structure from one answer to the next. All of these are signs that the generator needs tuning. Here, we can tune two components for optimal performance: the large language model (LLM), and the prompt.

Large language model (LLM)

In the current market, we have a wide variety of LLMs to choose from and it becomes important to select the appropriate one for the generator in our use case. To choose the right LLM,we need to consider that the performance of the LLM depends on the following factors:

Size of the LLM: In general, larger models (e.g., GPT-4, Llama) perform better than smaller ones in synthesizing a response from multiple chunks. However, they are also more expensive and have higher latency. The size of LLMs is an evolving research area, with OpenAI, Meta, Anthropic etc. coming up with smaller models that perform on par with the larger ones. I tend to do ablation studies on a few LLMS before finally deciding the one that gives the best combination of generator metrics for my use case.

Context size: Although modern LLMs support large context windows (up to 100k tokens), this doesn’t mean all available space should be used. In my experience, given the huge context size that current state-of-the-art LLMs provide, the primary deciding factor is the number of chunks that should be passed instead of the maximum number of chunks that can be passed. This is because models exhibit a “lost-in-the-middle” issue, favoring content at the beginning and end of the context window. Passing too many chunks can dilute attention and degrade the generator metrics. It’s better to pass a smaller, high-quality subset of chunks, ranked and filtered for relevance.

Temperature: Setting an optimum temperature (t) strikes the right balance between determinism and randomness of the next token during answer generation. If the use case requires deterministic responses, setting t=0 will increase the reproducibility of the responses. Note that t=0 does not mean a completely deterministic answer; it just means that it narrows the probability distribution of likely next tokens, which can improve consistency across responses.

Design better prompts

Depending on who you talk to, prompting tends to be either overhyped or undervalued: overhyped because even with good prompts, the other components of RAG contribute significantly to the performance, and undervalued because well-structured prompts can take you quite close to ideal responses. The truth, in my experience, lies somewhere in between. A well-structured prompt won’t fix a broken pipeline, but it can take a solid setup and make it meaningfully better.

A teammate of mine, a senior engineer, once told me to think of prompts like code. That idea stuck with me. Just like clean code, a good prompt should be easy to read, focused, and follow the “single responsibility” principle. In practice, that means keeping prompts simple and asking them to do one or two things really well. Adding in-context examples—realistic query–response pairs from your production data—can also go a long way in improving response quality.

There’s also a lot of talk in the literature about Chain of Thought prompting, where you ask the model to reason step by step. While that can work well for complex reasoning tasks, I haven’t seen it add much value in my day-to-day use cases—like chatbots or agent workflows. In fact, it often increases latency and hallucination risk. So unless your use case truly benefits from reasoning out loud, I’d recommend keeping prompts clean, focused, and purpose-driven.

Stage 3: Post-processing

Even with a strong retriever and a well-tuned generator, I found that the output of a RAG pipeline may still need a final layer of quality control checks around hallucinations and harmfulness before it’s shown to users.

It is because no matter how high-quality the prompt is, it does not totally shield the generated response from the possibility of producing responses that are hallucinated, overly confident, or even harmful, especially when dealing with sensitive or high-stakes content. In other cases, the response might be technically correct but needs polishing: adjusting the tone, adding context, personalizing for the end user, or including disclaimers.

This is where post-processing comes in. While optional, this stage acts as a safeguard, ensuring that responses meet quality, safety, and formatting standards before reaching the end user.

The checks for hallucination and harmfulness can either be integrated into the LLM call of the generator (e.g., OpenAI returns harmfulness, toxicity, and bias scores for each response) or performed via a separate LLM call once the generator has synthesized the response. In the latter case, I recommend using a stronger model than the one used for generation if latency and cost allow. The second model evaluates the generated response in the context of the original query and the retrieved chunks, flagging potential risks or inconsistencies.

When the goal is to rephrase, format, or lightly enhance a response rather than evaluate it for safety, I’ve found that a smaller LLM performs good enough. Because this model only needs to clean or refine the text, it can handle the task effectively without driving up latency or cost.

Post-processing doesn’t need to be complicated, but it can have a big impact on the reliability and user experience of a RAG system. When used thoughtfully, it adds an extra layer of confidence and polish that’s hard to achieve through generation alone.

Final thoughts

Evaluating a RAG pipeline isn’t something you do once and forget about, it’s a continuous process that plays a big role in whether your system actually works well in the real world. RAG systems are powerful, but they’re also complex. With so many moving parts, it’s easy to miss what’s actually going wrong or where the biggest improvements could come from.

The best way to make sense of this complexity is to break things down. Throughout this article, we looked at how to evaluate and optimize RAG pipelines in three stages: pre-processing, processing, and post-processing. This structure helps you focus on what matters at each step, from chunking and embedding to tuning your retriever and generator to applying final quality checks before showing an answer to the user.

If you’re building a RAG system, the best next step is to get a simple version up and running, then start measuring. Use the metrics and framework we’ve covered to figure out where things are working well and where they’re falling short. From there, you can start making small, focused improvements, whether that’s rewriting queries, tweaking your prompts, or switching out your retriever. If you already have a system in production, it’s worth stepping back and asking: Are we still optimizing based on what really matters to our users?

There’s no single metric that tells you everything is fine. But by combining evaluation metrics with user feedback and iterating stage by stage, you can build something that’s not just functional but also reliable and useful.

Was the article useful?

Yes

Explore more content topics:

Evaluating RAG Pipelines

Serve is betting that food delivery and access to public markets are the keys to scaling robotics

AI learns how vision and sound are connected, without human intervention | MIT News

softbliss

Related Posts

Cubify Anything: Scaling Indoor 3D Object Detection

AI system predicts protein fragments that can bind to or inhibit a target | MIT News

Use NotebookLM to learn about I/O 2025

How Many Ways Can You Build a Brain? Cracking the Code of Neural Redundancy | by Andreas Maier | May, 2025

Build a domain‐aware data preprocessing pipeline: A multi‐agent collaboration approach

AI learns how vision and sound are connected, without human intervention | MIT News

Leave a Reply Cancel reply

Premium Content

Evolving Product Operating Models in the Age of AI

14 Powerful Techniques Defining the Evolution of Embedding

How iFood built a platform to run hundreds of machine learning models with Amazon SageMaker Inference

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Evaluating RAG Pipelines

Dimensions of RAG evaluation

Why is the evaluation of a RAG pipeline challenging?

How to evaluate a RAG pipeline

Curating an evaluation dataset

Evaluation of the retriever

Non-rank based metrics

Rank-based metrics

Evaluation of the generator

Citation-based metrics

Response-based metrics

F1 score at token level

Qualitative indicators

End-to-end evaluation

Optimizing the performance of a RAG pipeline

Stage 1: Pre-processing

Optimizing the knowledge base (KB)

1. Chunking Strategy

2. Chunk Size

3. Metadata

Optimizing the user query

Query rewriting

Adding context to the query

Simplifying overly long queries

Query routing

Optimizing the vector database

Choosing the right vector database

Embedding model selection

Stage 2: Processing

Optimizing the retriever

Ensuring data quality in the knowledge base

Choose the optimal retrieval algorithm

Apply re-ranking

Optimizing the generator

Large language model (LLM)

Design better prompts

Stage 3: Post-processing

Final thoughts

Was the article useful?

Explore more content topics:

Serve is betting that food delivery and access to public markets are the keys to scaling robotics

AI learns how vision and sound are connected, without human intervention | MIT News

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?