Evaluator-Optimizer

Evaluator-Optimizer pattern allows an LLM to improve the quality of generation by optimizing a previous generation with feedback from an evaluator. This evaluation-optimization loop can be done multiple times.

Implementation

This pattern consists of five subtasks. Each subtask is implemented using Task Execution pattern. Only the task to generate the initial result is required. All other four subtasks are optional.

A task to initialize the input for generation.
A task to generate the initial result.
A task to evaluate the result and provide feedback.
A task to optimize a previous result based on the feedback.
A task to finalize the response.

The flow chart below shows the basic steps.

Guides

Use Different Models

It's recommended to use different models for generation and evaluation to get better results. For example, we can use GPT-4o for generation and DeepSeek V3 for evaluation, or vice versa.

Max Number of Evaluations

It's recommended to set a limit to the maximum number of evaluations. After finishing the maximum number of evaluations, the result of last generation will be returned, even though it doesn't pass the evaluation. This will reduce the response time and cut down the costs.

Evaluation Results

For the evaluation result, it can be a boolean type with values yes or no only, or a numeric type with the score from 0 to 100. Numeric scores can be more flexible, as we can set different thresholds to mark the result as passed or not passed. When combined with Parallelization Workflow pattern, we can run multiple parallel evaluations and use the average score as the final score.

Example

Let's see an example. The example is code generation with multiple rounds of reviews.

Generate Initial Result

For generating the initial result, the prompt template only contains the task input. The template variable input represents the task input. Java code will be generated.

Prompt template for generating initial result
Goal: Generate Java code to complete the task based on the input.

{input}

Output of this task is the generated code.

Evaluate

After generating the initial result, we can evaluate the generated code. The template variable code is the code to evaluate. The template lists several rules to evaluate the code.

Prompt template for evaluation
Goal: Evaluate this code implementation for correctness, time complexity, and best practices.

Requirements:
- Ensure the code have proper javadoc documentation.
- Ensure the code follows good naming conventions.
- The code is passed if all criteria are met with no improvements needed.

{code}

Below is the type of evaluation result. Here a boolean result passed is used. feedback is the feedback provided by the evaluator when not passed.

Evaluation result
package com.javaaidev.agenticpatterns.evaluatoroptimizer;

import com.fasterxml.jackson.annotation.JsonPropertyDescription;
import org.jspecify.annotations.Nullable;

/**
 * Boolean evaluation result
 *
 * @param passed   Passed or not passed
 * @param feedback Feedback if not passed
 */
public record BooleanEvaluationResult(
    @JsonPropertyDescription("If evaluation passed") boolean passed,
    @JsonPropertyDescription("Feedback of evaluation") @Nullable String feedback) implements
    EvaluationResult {

  @Override
  @Nullable
  public String getFeedback() {
    return feedback();
  }
}

Optimize

If the evaluation result is not passed, then the code can be optimized based on the evaluation feedback. The template variable code represents the code to optimize, while feedback representing the feedback from the evaluator.

Prompt template for optimization
Goal: Improve existing code based on feedback.

Requirements:
- Address all concerns in the feedback.

Code:
{code}

Feedback:
{feedback}

The output is the optimized code. Optimized code will be evaluated again util it passes the evaluation, or after the maximum number of evaluations.

Example Result

The picture below shows the trace of an execution of this agent. Trace name agent.execute means an execution of an agent. The outer trace represents the execution of the code generation agent. There are four traces nested in the outer traces. In the order of executions, these four traces are:

Generation of the initial result.
First evaluation, not passed.
First optimization.
Second evaluation, passed.

Agent execution trace

The result that passes the evaluation will be the final result. The screenshot below shows the result of using Swagger UI to test the agent. The input is "quick sort", the output is generated Java code.

Swagger UI

Reference Implementation

See this page for reference implementation and examples.

Implementation​

Guides​

Use Different Models​

Max Number of Evaluations​

Evaluation Results​

Example​

Generate Initial Result​

Evaluate​

Optimize​

Example Result​

Reference Implementation​