Product Update: Offline Evaluations

At HoneyHive, our mission is to empower developers to build reliable and high-performing GenAI applications. Offline Evaluations play a crucial role in this process by allowing developers to test the quality of their applications against datasets of inputs (and optionally, ground truth labels) before deploying to production. This enables teams to compare results across experiments, identify failure modes across thousands of test cases, and automate testing workflows to ensure performance and reliability of their GenAI apps.

‍

Today, we’re excited to introduce powerful new updates to Offline Evaluations. We purposefully rebuilt our abstractions to support the countless ways developers must test their GenAI applications today — from evaluating prompts, models, and hyperparameter settings, to benchmarking RAG pipelines, tool-use, vector databases, GPU performance, and more.

‍

To our customers and partners — thank you for all your feedback over these past few months. We’re excited to see what you evaluate with HoneyHive!

‍

What’s New

‍

Evaluation Reports

We've completely redesigned Evaluation Reports to support complex LLM pipelines, multiple evaluators across different spans, and deep data analytics capabilities.

‍

New features like filters, groups, and Pass/Fail Percentage Summaries allow you to quickly understand how your application performs across different dimensions. For example, you can filter results by a specific input field, group related test cases together, and see at a glance what percentage of your test cases are passing or failing based on custom evaluators you define.

‍

You can easily view aggregations and distribution of scores across your experiment.

‍

Regression Testing

Regression testing allows you to compare two experiments side-by-side to track improvements and catch regressions across different configurations of your application as you iterate.

‍

For instance, suppose you've made changes to your prompt templates or are considering switching to an open-source model, and want to verify that the changes improve performance without introducing new issues. With regression testing, you can compare the results of your new prompts against a previous test run. Surfacing regressions is as simple as clicking the regressions button next to each evaluator in the summary.

‍

It's easy to catch regressions across experiments.

‍

Comparing Output Diffs

To help you identify fine-grained differences between test runs, we've added the ability to compare outputs for each test case side-by-side with new-line and word-level diffs.

‍

You can compare output diffs side-by-side to catch fine-grained errors.

‍

You can compare output diffs side-by-side to catch fine-grained errors.

‍

Human Evaluations

In addition to automated evaluators, you can now invite domain experts on your team to provide human evaluation of your model's outputs directly within the Evaluation Report interface.

‍

Domain experts can quickly traverse test case outputs using keyboard shortcuts and provide their feedback as numerical ratings or free-form comments. These human evaluation results are automatically incorporated into the Evaluation Summary as your experts work through the outputs. Developers and data scientists can then visualize this feedback alongside automated metrics in the built-in charting interface.

‍

Domain exports can analyze and score outputs, and navigate across sessions using keyboard shortcuts.

‍

GitHub Actions Integration

We now support integration with GitHub Actions, which allows developers to automate CI testing workflows and track progress against their main branch. This is useful to track improvements and catch regressions as you iterate, and also test performance periodically to detect prompt and model drift.

‍

Our customers have been already using this feature to automate daily/weekly tests for their agents, set up nightly runs, and more. Over the coming months, we plan to introduce more integrations with other CI providers.

‍

Get started today

‍

We're excited to see how these new Offline Evaluations features accelerate your GenAI testing and development workflows. With these new updates, we can confidently say HoneyHive offers the most comprehensive and powerful testing and evaluation framework on the market currently.

‍

To learn more about HoneyHive and get a guided tour of our platform, please schedule a free 30-minute consultation with our team.

About the author:

Mohak Sharma

Co-Founder and CEO

Product Update: Offline Evaluations

About the author:

Mohak Sharma

Join our monthly newsletter