Skip to main content

Evaluator-Optimizer

Evaluator-Optimizer pattern allows an LLM to improve the quality of generation by optimizing a previous generation with feedback from an evaluator. This evaluation-optimization loop can be done multiple times.

Implementation

This pattern consists of three subtasks. Each subtask is implemented using Task Execution pattern.

  • A task to generate the initial result.
  • A task to evaluate the result and provide feedback.
  • A task to optimize a previous result based on the feedback.

The flow chart below shows the basic steps.

Guides

Use Different Models

It's recommended to use different models for generation and evaluation to get better results. For example, we can use GPT-4o for generation and DeepSeek V3 for evaluation, or vice versa.

Max Iterations

It's recommended to set a limit to the maximum number of evaluations. This will reduce the response time and cut down the costs.

Evaluation Results

For the evaluation result, it can be a boolean type with values yes or no only, or a numeric type with the score from 0 to 100. Numeric scores can be more flexible, as we can set different thresholds to mark the result as passed or not passed. When combined with Parallelization Workflow pattern, we can run multiple parallel evaluations and use the average score as the final score.

Example

Let's see an example. The example is code generation with multiple rounds of reviews.

Generate Initial Result

For generating the initial result, the prompt template only contains the task input. The template variable input represents the task input. Java code will be generated.

Prompt template for generating initial result
Goal: Generate Java code to complete the task based on the input.

{input}

Output of this task is the generated code.

Evaluate

After generating the initial result, we can evaluate the generated code. The template variable code is the code to evaluate. The template lists several rules to evaluate the code.

Prompt template for evaluation
Goal: Evaluate this code implementation for correctness, time complexity, and best practices.

Requirements:
- Ensure the code have proper javadoc documentation.
- Ensure the code follows good naming conventions.
- The code is passed if all criteria are met with no improvements needed.

{code}

Below is the type of evaluation result. Here a boolean result is used.

Evaluation result
package com.javaaidev.agenticpatterns.evaluatoroptimizer;

import org.jspecify.annotations.Nullable;

/**
* Evaluation result
*
* @param passed Passed or not passed
* @param feedback Feedback if not passed
*/
public record Evaluation(boolean passed, @Nullable String feedback) {

}

Optimize

If the evaluation result is not passed, then the code can be optimized based on the evaluation feedback. The template variable code represents the code to optimize, while feedback representing the feedback from the evaluator.

Prompt template for optimization
Goal: Improve existing code based on feedback.

Requirements:
- Address all concerns in the feedback.

Code:
{code}

Feedback:
{feedback}

The output is the optimized code. Optimized code will be evaluated again util it passes the evaluation, or after the maximum number of iterations.

Reference Implementation

See this page for reference implementation and examples.