Since 2023, AI products have rapidly evolved from simple “GPT-wrappers” around LLMs—where a single prompt was the main interface—to far more complex, compound AI systems.
These include everything from RAG pipelines to multi-agent workflows. In these modern systems, a single output is often the result of multiple, interdependent steps. This shift means that evaluating only the final output, or just a single prompt, is no longer sufficient. Errors can cascade through the pipeline, making it crucial to evaluate each component individually. This is where tracing becomes essential: by breaking down the workflow into discrete steps, tracing allows us to monitor, diagnose, and improve each part of the system, leading to more robust and reliable AI products.
This post explores why assessing distinct parts of a multi-step LLM application is crucial for building reliable AI products. We’ll discuss the challenges of evaluating these workflows and demonstrate how to implement component-level metrics using a practical RAG pipeline example involving MongoDB , OpenAI , and HoneyHive for evaluation.
For the full code and setup instructions, check out the HoneyHive cookbooks repository!
Challenges of Evaluating Multi-Step Workflows Multi-step workflows, such as those in Retrieval-Augmented Generation (RAG) pipelines, present unique challenges when it comes to evaluation. These systems rely on multiple interconnected components—like document retrieval, ranking, and answer generation—each of which can influence the final output. Here are some of the key challenges:
Identifying Bottlenecks : Difficulty pinpointing which specific step causes poor overall performance without component-level evaluation.Hidden Intermediate Errors : Seemingly correct final outputs can mask underlying flaws in earlier steps, risking failures in different scenarios.Error Propagation : Mistakes in early stages can compound, significantly degrading the quality of the final result.Interdependencies : Components rely heavily on outputs from previous steps, making it hard to assess their individual performance in isolation.
By breaking down the evaluation process and assessing each component individually, developers can gain deeper insights into their systems, identify bottlenecks, and make targeted improvements. This approach is especially important for building reliable and scalable AI systems that perform well across a variety of use cases.
Component-Level Evaluation with HoneyHive Addressing this challenge requires tools designed for the intricacies of multi-step workflows. HoneyHive provides the necessary framework by capturing detailed traces of your application's execution. Each trace breaks down the workflow into individual spans , which represent specific operations like LLM calls, function executions, or tool usage.
This granular view is key. HoneyHive allows you to attach custom metrics and metadata at different levels of this hierarchy:
Span-Level Metrics (enrich_span
): You can calculate and log metrics specific to a single step or component directly within its corresponding span. For instance, in a RAG pipeline, you could calculate and log retrieval_relevance
within the get_relevant_docs
span, or answer_faithfulness
within the generate_response
span. This isolates the performance measurement to the exact point of execution.Session-Level Metrics (enrich_session
): You can also log metrics that summarize the entire end-to-end execution or relate to the interaction as a whole. Examples include overall latency, total token count, or metrics comparing the final output to the initial input.
This flexible approach allows you to define evaluators that operate on the final output (like checking consistency against ground truth) while simultaneously capturing performance indicators for every intermediate step. You gain both a high-level view and the deep, component-specific insights needed for effective debugging and optimization.
Case Study: Evaluating a RAG Pipeline Step-by-Step Let's illustrate component-level evaluation with a practical example: a Retrieval-Augmented Generation (RAG) pipeline designed for medical/health question answering.This pipeline uses MongoDB Atlas for vector search to retrieve relevant medical articles and OpenAI's API to generate answers based on those articles.
On the evaluation side, we'll demonstrate how to run an offline experiment using HoneyHive's evaluate
function . This involves:
Defining a curated dataset with input queries and ground truth answers. Running our RAG application logic against each dataset entry. Using evaluators to compute metrics for both individual components and the overall pipeline output.
This experimental setup allows us to systematically track metrics and compare performance across different pipeline versions or configurations.
You can check the complete code and setup instructions for this case study in HoneyHive's cookbook repository.
1. Overview Let's go through the main components of this example by splitting it into two parts: The RAG pipeline we wish to evaluate and the evaluators used to assess its performance.
RAG Pipeline The pipeline consists of the following steps:
Document Retrieval : Using MongoDB's vector search capabilities, we retrieve the most relevant documents for a given query.Response Generation : Using OpenAI's API, we generate a response based on the retrieved documents and the query.
Evaluators Retrieval Evaluator : This evaluator assesses the retrieval process by measuring the semantic similarity between the query and the retrieved documents.Response Evaluator : This evaluator measures the semantic similarity between the model's final response and the provided ground truth for each query.Pipeline Evaluator : This evaluator generates basic metrics related to the overall pipeline, such as the number of retrieved documents and the query length.Overview of the pipeline to be evaluated.
2. Running the experiment To evaluate the RAG pipeline, we leverage HoneyHive's evaluate
function, which is designed to execute the pipeline for each entry in the dataset while capturing detailed execution traces and spans.By providing the function to be evaluated (rag_pipeline
), the dataset, and the primary evaluator (consistency_evaluator
), we can assess the similarity between the model's output and the ground truth.
Additionally, since we are tracing and enriching the pipeline's intermediate steps (as detailed in the following section), the evaluation process automatically logs custom metrics for these components, offering deeper insights into the pipeline's performance.
if __name__ == "__main__" :
# Setup MongoDB with sample data
setup_mongodb()
# Run experiment
evaluate(
function = rag_pipeline ,
hh_api_key = os . getenv ( 'HONEYHIVE_API_KEY' ),
hh_project = os . getenv ( 'HONEYHIVE_PROJECT' ),
name =' MongoDB RAG Pipeline Evaluation ',
dataset = dataset ,
evaluators =[ consistency_evaluator ],
)
3. Implementing the RAG Pipeline The RAG pipeline is designed to handle the entire process of retrieving relevant documents and generating a response.Session-level metrics are captured using the enrich_session
function, which provides basic insights such as the number of retrieved documents and the query length.
def rag_pipeline(inputs: Dict, ground_truths : Dict) -> str:
"" "Complete RAG pipeline that retrieves docs and generates response" ""
query = inputs[ "query" ]
docs = get_relevant_docs(query)
response = generate_response(docs, query)
enrich_session(metrics={
"rag_pipeline" : {
"num_retrieved_docs" : len(docs),
"query_length" : len(query.split())
}
})
return response
To ensure thorough evaluation, retrieval-specific metrics are enriched directly within the get_relevant_docs
function using the enrich_span
method. This allows for detailed assessment of the retrieval process at each step.
@trace
def get_relevant_docs(query: str, top_k : int = 2 ):
"" "Retrieves relevant documents from MongoDB using semantic search" ""
# Compute query embedding
query_embedding = model.encode(query).tolist()
retrieved_docs = []
retrieved_embeddings = []
try :
# Search for similar documents using vector similarity
pipeline = [
{
"$vectorSearch" : {
"index" : "vector_index" ,
"path" : "embedding" ,
"queryVector" : query_embedding,
"numCandidates" : top_k * 2 , # Search through more candidates for better results
"limit" : top_k
}
}
]
results = list(collection.aggregate(pipeline))
retrieved_docs = [doc[ "content" ] for doc in results]
retrieved_embeddings = [doc[ "embedding" ] for doc in results]
except Exception as e:
print(f "Vector search error: {e}" )
# Fallback to basic find if vector search fails
results = list(collection.find().limit(top_k))
retrieved_docs = [doc[ "content" ] for doc in results]
retrieved_embeddings = [doc[ "embedding" ] for doc in results]
# Calculate and record metrics regardless of which path was taken
if retrieved_embeddings:
retrieval_relevance = retrieval_relevance_evaluator(query_embedding, retrieved_embeddings)
else :
retrieval_relevance = 0.0
enrich_span(metrics={
"retrieval_relevance" : retrieval_relevance,
"num_docs_retrieved" : len(retrieved_docs)
})
return retrieved_docs
4. Defining the Dataset We create a small evaluation dataset containing input queries and their corresponding ground-truth answers.
dataset = [
{
"inputs" : { "query" : "How does exercise affect diabetes?" },
"ground_truths" : { "response" : "Regular exercise reduces diabetes risk..." }
},
# ... more examples ...
{
"inputs" : { "query" : "How do sleep patterns affect mental health?" },
"ground_truths" : { "response" : "Sleep patterns significantly impact mental well-being..." }
}
]
5. Analyzing Results After running the experiment, you can view the results in the Experiments page in HoneyHive:
The Experiments Dashboard. For the retrieval step, we observe that some queries resulted in low retrieval relevance.Examining the Evaluation Summary on the left, we also notice that the average response consistency (0.73) is higher than the average retrieval relevance (0.41).Let's take a closer look at the distribution of these metrics:
Response Consistency - Distribution. Retrieval Relevance - Distribution.
This suggests that while the model's responses are generally on-topic, they may not always be grounded in the source of truth—particularly for the two examples with retrieval relevance scores below 0.25.
Let's drill down into one of these examples:
Low retrieval relevance data point. It’s clear that the retrieved documents are not relevant to the user’s query. With this insight, we can narrow our investigation to the retrieval system as the likely source of the issue.In this case, we discover that our vector database lacks sufficient documents on specific topics, such as stress and sleep disorders, which explains the low relevance scores.
Next Steps Once you’ve identified issues—such as retrieval relevancy—in your pipeline, the next step is to experiment with targeted improvements. Improve coverage of your vector database, try different retrieval strategies, compare the results, and iteratively optimize your metrics. Additionally, incorporate human evaluations using tools like “Review Mode” to ensure that your automated metrics align with the judgments of your domain experts.
As your product and user behaviors evolve, make it a habit to regularly update and expand your evaluation dataset to ensure your benchmarks remain relevant. This combination of metric-driven and human-in-the-loop evaluation will help you build more effective and reliable multi-step LLM workflows.