Skip to main content

Evaluation Framework

ZenSearch includes a built-in evaluation framework for measuring search quality. Use standardized retrieval metrics to benchmark performance, track improvements over time, and validate changes to your search configuration.

Overview

The evaluation framework lets you:

  • Define test sets with queries and expected results
  • Run evaluations against your live search pipeline
  • Measure performance using standard information retrieval metrics
  • Compare results across configuration changes

Metrics

NDCG@K (Normalized Discounted Cumulative Gain)

Measures ranking quality by accounting for the position of relevant results. Higher scores mean relevant documents appear closer to the top.

  • NDCG@5: How good is the top-5 ranking?
  • NDCG@10: How good is the top-10 ranking?
  • Range: 0.0 (worst) to 1.0 (perfect ranking)

NDCG is the primary metric for evaluating search quality because it rewards correct ordering, not just inclusion.

MRR (Mean Reciprocal Rank)

Measures how quickly the first relevant result appears. An MRR of 1.0 means the correct answer is always the first result.

  • Best for: Single-answer queries ("What is our refund policy?")
  • Range: 0.0 to 1.0

MAP (Mean Average Precision)

Measures precision at each relevant result position, then averages. Rewards systems that return relevant documents early and consistently.

  • Best for: Multi-answer queries where several documents are relevant
  • Range: 0.0 to 1.0

Precision@K

The fraction of top-K results that are relevant.

  • Precision@5: Of the top 5 results, how many are relevant?
  • Range: 0.0 to 1.0

Recall@K

The fraction of all relevant documents that appear in the top-K results.

  • Recall@10: Of all relevant documents, how many are in the top 10?
  • Range: 0.0 to 1.0

Test Sets

What Is a Test Set?

A test set is a collection of queries paired with their expected results. Each entry specifies:

FieldDescription
QueryThe search query to evaluate
Expected DocumentsDocuments that should appear in results
Relevance GradesHow relevant each expected document is (0-3 scale)

Relevance Scale

GradeMeaning
3Perfectly relevant — directly answers the query
2Highly relevant — contains substantial useful information
1Partially relevant — contains some related content
0Not relevant — should not appear in results

Creating a Test Set

  1. Navigate to Settings > Evaluation
  2. Click Create Test Set
  3. Add query-document pairs with relevance grades
  4. Save the test set

Building Good Test Sets

  • Cover common queries: Include the types of questions your users ask most
  • Include edge cases: Queries with few results, ambiguous terms, or multiple intents
  • Use real queries: Pull from search logs when possible
  • Balance difficulty: Mix easy lookups with complex multi-document queries
  • Update regularly: Add new entries as content and user needs evolve

Running Evaluations

On-Demand Evaluation

  1. Select a test set
  2. Choose the collections to evaluate against
  3. Click Run Evaluation
  4. View results when the evaluation completes

Evaluation Results

Results include:

  • Aggregate metrics: Overall NDCG@K, MRR, MAP, precision, and recall
  • Per-query breakdown: Individual scores for each test query
  • Failure analysis: Queries that scored below a threshold
  • Metric trends: How scores compare to previous evaluation runs

Interpreting Results

Score RangeQuality
0.8 - 1.0Excellent search quality
0.6 - 0.8Good — most queries return relevant results
0.4 - 0.6Fair — some queries need improvement
Below 0.4Poor — investigate pipeline configuration

Use Cases

Validating Configuration Changes

Before and after changing search settings (embedding models, reranking thresholds, query expansion):

  1. Run evaluation with current configuration
  2. Apply changes
  3. Run evaluation again
  4. Compare metrics to ensure improvement

Monitoring Over Time

Schedule periodic evaluations to catch regressions as content changes:

  • New documents may shift relevance distributions
  • Deleted content can invalidate test expectations
  • Model updates may affect embedding quality

Benchmarking Collections

Compare search quality across different collections to identify areas that need more or better content.

Best Practices

  1. Start small: Begin with 20-50 query-document pairs covering core use cases
  2. Multiple reviewers: Have different team members grade relevance to reduce bias
  3. Separate test sets by domain: Use different test sets for engineering docs vs. sales content
  4. Track over time: Keep evaluation history to measure progress
  5. Automate: Run evaluations after significant content changes or configuration updates

Next Steps