Evaluation Framework

ZenSearch includes a built-in evaluation framework for measuring search quality. Use standardized retrieval metrics to benchmark performance, track improvements over time, and validate changes to your search configuration.

Overview

The evaluation framework lets you:

Define test sets with queries and expected results
Run evaluations against your live search pipeline
Measure performance using standard information retrieval metrics
Compare results across configuration changes

Metrics

NDCG@K (Normalized Discounted Cumulative Gain)

Measures ranking quality by accounting for the position of relevant results. Higher scores mean relevant documents appear closer to the top.

NDCG@5: How good is the top-5 ranking?
NDCG@10: How good is the top-10 ranking?
Range: 0.0 (worst) to 1.0 (perfect ranking)

NDCG is the primary metric for evaluating search quality because it rewards correct ordering, not just inclusion.

MRR (Mean Reciprocal Rank)

Measures how quickly the first relevant result appears. An MRR of 1.0 means the correct answer is always the first result.

Best for: Single-answer queries ("What is our refund policy?")
Range: 0.0 to 1.0

MAP (Mean Average Precision)

Measures precision at each relevant result position, then averages. Rewards systems that return relevant documents early and consistently.

Best for: Multi-answer queries where several documents are relevant
Range: 0.0 to 1.0

Precision@K

The fraction of top-K results that are relevant.

Precision@5: Of the top 5 results, how many are relevant?
Range: 0.0 to 1.0

Recall@K

The fraction of all relevant documents that appear in the top-K results.

Recall@10: Of all relevant documents, how many are in the top 10?
Range: 0.0 to 1.0

Test Sets

What Is a Test Set?

A test set is a collection of queries paired with their expected results. Each entry specifies:

Field	Description
Query	The search query to evaluate
Expected Documents	Documents that should appear in results
Relevance Grades	How relevant each expected document is (0-3 scale)

Relevance Scale

Grade	Meaning
3	Perfectly relevant — directly answers the query
2	Highly relevant — contains substantial useful information
1	Partially relevant — contains some related content
0	Not relevant — should not appear in results

Creating a Test Set

Navigate to Settings > Evaluation
Click Create Test Set
Add query-document pairs with relevance grades
Save the test set

Building Good Test Sets

Cover common queries: Include the types of questions your users ask most
Include edge cases: Queries with few results, ambiguous terms, or multiple intents
Use real queries: Pull from search logs when possible
Balance difficulty: Mix easy lookups with complex multi-document queries
Update regularly: Add new entries as content and user needs evolve

Running Evaluations

On-Demand Evaluation

Select a test set
Choose the collections to evaluate against
Click Run Evaluation
View results when the evaluation completes

Evaluation Results

Results include:

Aggregate metrics: Overall NDCG@K, MRR, MAP, precision, and recall
Per-query breakdown: Individual scores for each test query
Failure analysis: Queries that scored below a threshold
Metric trends: How scores compare to previous evaluation runs

Interpreting Results

Score Range	Quality
0.8 - 1.0	Excellent search quality
0.6 - 0.8	Good — most queries return relevant results
0.4 - 0.6	Fair — some queries need improvement
Below 0.4	Poor — investigate pipeline configuration

Use Cases

Validating Configuration Changes

Before and after changing search settings (embedding models, reranking thresholds, query expansion):

Run evaluation with current configuration
Apply changes
Run evaluation again
Compare metrics to ensure improvement

Monitoring Over Time

Schedule periodic evaluations to catch regressions as content changes:

New documents may shift relevance distributions
Deleted content can invalidate test expectations
Model updates may affect embedding quality

Benchmarking Collections

Compare search quality across different collections to identify areas that need more or better content.

Best Practices

Start small: Begin with 20-50 query-document pairs covering core use cases
Multiple reviewers: Have different team members grade relevance to reduce bias
Separate test sets by domain: Use different test sets for engineering docs vs. sales content
Track over time: Keep evaluation history to measure progress
Automate: Run evaluations after significant content changes or configuration updates

Next Steps

Advanced Search - Search pipeline configuration
Agents - Agent-powered research
API - Search API reference

Overview​

Metrics​

NDCG@K (Normalized Discounted Cumulative Gain)​

MRR (Mean Reciprocal Rank)​

MAP (Mean Average Precision)​

Precision@K​

Recall@K​

Test Sets​

What Is a Test Set?​

Relevance Scale​

Creating a Test Set​

Building Good Test Sets​

Running Evaluations​

On-Demand Evaluation​

Evaluation Results​

Interpreting Results​

Use Cases​

Validating Configuration Changes​

Monitoring Over Time​

Benchmarking Collections​

Best Practices​

Next Steps​