Evaluation Framework
ZenSearch includes a built-in evaluation framework for measuring search quality. Use standardized retrieval metrics to benchmark performance, track improvements over time, and validate changes to your search configuration.
Overview
The evaluation framework lets you:
- Define test sets with queries and expected results
- Run evaluations against your live search pipeline
- Measure performance using standard information retrieval metrics
- Compare results across configuration changes
Metrics
NDCG@K (Normalized Discounted Cumulative Gain)
Measures ranking quality by accounting for the position of relevant results. Higher scores mean relevant documents appear closer to the top.
- NDCG@5: How good is the top-5 ranking?
- NDCG@10: How good is the top-10 ranking?
- Range: 0.0 (worst) to 1.0 (perfect ranking)
NDCG is the primary metric for evaluating search quality because it rewards correct ordering, not just inclusion.
MRR (Mean Reciprocal Rank)
Measures how quickly the first relevant result appears. An MRR of 1.0 means the correct answer is always the first result.
- Best for: Single-answer queries ("What is our refund policy?")
- Range: 0.0 to 1.0
MAP (Mean Average Precision)
Measures precision at each relevant result position, then averages. Rewards systems that return relevant documents early and consistently.
- Best for: Multi-answer queries where several documents are relevant
- Range: 0.0 to 1.0
Precision@K
The fraction of top-K results that are relevant.
- Precision@5: Of the top 5 results, how many are relevant?
- Range: 0.0 to 1.0
Recall@K
The fraction of all relevant documents that appear in the top-K results.
- Recall@10: Of all relevant documents, how many are in the top 10?
- Range: 0.0 to 1.0
Test Sets
What Is a Test Set?
A test set is a collection of queries paired with their expected results. Each entry specifies:
| Field | Description |
|---|---|
| Query | The search query to evaluate |
| Expected Documents | Documents that should appear in results |
| Relevance Grades | How relevant each expected document is (0-3 scale) |
Relevance Scale
| Grade | Meaning |
|---|---|
| 3 | Perfectly relevant — directly answers the query |
| 2 | Highly relevant — contains substantial useful information |
| 1 | Partially relevant — contains some related content |
| 0 | Not relevant — should not appear in results |
Creating a Test Set
- Navigate to Settings > Evaluation
- Click Create Test Set
- Add query-document pairs with relevance grades
- Save the test set
Building Good Test Sets
- Cover common queries: Include the types of questions your users ask most
- Include edge cases: Queries with few results, ambiguous terms, or multiple intents
- Use real queries: Pull from search logs when possible
- Balance difficulty: Mix easy lookups with complex multi-document queries
- Update regularly: Add new entries as content and user needs evolve
Running Evaluations
On-Demand Evaluation
- Select a test set
- Choose the collections to evaluate against
- Click Run Evaluation
- View results when the evaluation completes
Evaluation Results
Results include:
- Aggregate metrics: Overall NDCG@K, MRR, MAP, precision, and recall
- Per-query breakdown: Individual scores for each test query
- Failure analysis: Queries that scored below a threshold
- Metric trends: How scores compare to previous evaluation runs
Interpreting Results
| Score Range | Quality |
|---|---|
| 0.8 - 1.0 | Excellent search quality |
| 0.6 - 0.8 | Good — most queries return relevant results |
| 0.4 - 0.6 | Fair — some queries need improvement |
| Below 0.4 | Poor — investigate pipeline configuration |
Use Cases
Validating Configuration Changes
Before and after changing search settings (embedding models, reranking thresholds, query expansion):
- Run evaluation with current configuration
- Apply changes
- Run evaluation again
- Compare metrics to ensure improvement
Monitoring Over Time
Schedule periodic evaluations to catch regressions as content changes:
- New documents may shift relevance distributions
- Deleted content can invalidate test expectations
- Model updates may affect embedding quality
Benchmarking Collections
Compare search quality across different collections to identify areas that need more or better content.
Best Practices
- Start small: Begin with 20-50 query-document pairs covering core use cases
- Multiple reviewers: Have different team members grade relevance to reduce bias
- Separate test sets by domain: Use different test sets for engineering docs vs. sales content
- Track over time: Keep evaluation history to measure progress
- Automate: Run evaluations after significant content changes or configuration updates
Next Steps
- Advanced Search - Search pipeline configuration
- Agents - Agent-powered research
- API - Search API reference