Topic 10 — Evaluating Prompt Quality

1

Why Evaluation Matters

Many teams rely on "gut feel" to judge prompt quality: "That feels right" or "I tried a few versions and this one seemed best." This is anecdotal and unreliable. In production, you need systematic evaluation.

⚠️

The Anecdotal Testing Problem

Testing on 5 examples doesn't tell you if your prompt works on 100,000. A prompt that "feels good" might actually be brittle. You need data and metrics, not intuition.

📈

Benefits of Systematic Evaluation

Measure improvements: Know if a change actually helped. Catch regressions: Detect when updates break previous functionality. Compare versions: Objectively choose between prompt variants. Track trends: See if quality improves or degrades over time. Build confidence: Deploy to production knowing your prompt has been validated.

✅

The Gold Standard

Create a test dataset (20-50+ examples) with expected outputs. Run your prompt against all test cases. Score the results automatically or with human eval. Track scores over time as you iterate.

2

Types of Evaluation

There are three main approaches to evaluating prompts. Each has trade-offs.

👥

Human Evaluation

Pros: Most accurate, catches nuance. Cons: Expensive, slow, doesn't scale to thousands of examples.

⚙️

Automated Metrics

Pros: Fast, cheap, reproducible. Cons: Can miss subtle issues, requires good metric selection.

🤖

LLM-as-Judge

Pros: Understands nuance, scales well. Cons: Expensive, can be biased by judge model's behavior.

🔄

Hybrid

Pros: Best of all worlds. Cons: Most complex to set up.

💡

Analogy: Restaurant Ratings

Human eval: A food critic tastes the dish. Automated metrics: Check if it's hot, pH is neutral, etc. LLM-as-judge: Another chef tastes it and scores it. Hybrid: Metrics + one critic for edge cases.

3

Building an Eval Dataset

The foundation of good evaluation is a good test dataset. This takes time to create but pays off.

Structure: Input → Expected Output

For each test case, you need an input (prompt) and the expected or ideal output. Store as JSON for easy programmatic access.

JSON — Eval Dataset Example

[
  {
    "id": "1",
    "input": "Summarize: The Great Gatsby is a novel...",
    "expected_output": "A 1920s-set novel about wealth, love, and the American Dream",
    "category": "literature"
  },
  {
    "id": "2",
    "input": "Translate to Spanish: Hello, how are you?",
    "expected_output": "Hola, ¿cómo estás?",
    "category": "translation"
  },
  {
    "id": "3",
    "input": "Fix this code: def add(a, b return a + b",
    "expected_output": "def add(a, b):\n    return a + b",
    "category": "code"
  }
]

Dataset Size Recommendations

20-30 examples: Minimum for catching major issues.
50-100 examples: Good coverage of different scenarios.
200+: Comprehensive testing across categories and edge cases.

💎

Diverse Examples Matter

Don't just test happy paths. Include edge cases: empty inputs, very long inputs, ambiguous requests, requests in different languages, etc. This catches bugs before they hit production.

4

LLM-as-Judge Pattern

Use Claude itself to judge the quality of outputs. It's surprisingly effective and understands nuance that simple metrics miss.

The Judge Prompt

Write a system prompt that describes what good output looks like. Give it the input and output to evaluate. Ask it to score and explain.

Python — LLM-as-Judge

import anthropic
import json

judge_system = """
You are an expert evaluator of code quality.
Score code on:
1. Correctness (does it work?)
2. Readability (is it clear?)
3. Efficiency (is it optimal?)

Return a JSON object with:
- score (1-10)
- reasoning (brief explanation)
- issues (list of problems, or empty)
"""

client = anthropic.Anthropic()

def evaluate_code(code_to_review):
    judge_input = f"""
Evaluate this code:

```python
{code_to_review}
```

Return JSON with score, reasoning, and issues."""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=500,
        system=judge_system,
        messages=[{
            "role": "user",
            "content": judge_input
        }]
    )

    # Parse the JSON response
    result_text = response.content[0].text
    try:
        result = json.loads(result_text)
        return result["score"]
    except:
        print("Failed to parse judge response")
        return None

⚠️

Judge Model Bias

The judge model can have bias. Different models might score differently. To mitigate: use the same judge model for all evaluations, be explicit in the judge prompt, double-check with human validation on a subset.

5

Automated Metrics

For certain tasks, automated metrics work well. They're fast and deterministic. Choose the right metric for your task.

✓

Exact Match

Output == expected? Binary: yes/no. Good for: translation, structured output. Bad for: open-ended tasks.

📏

BLEU Score

N-gram overlap with reference. Standard for translation. Range: 0-100. Higher is better.

🎯

ROUGE Score

Recall-oriented for summarization. Measures overlap with reference summaries. Good for: summary eval.

📊

Cosine Similarity

Embedding-based similarity. Captures semantic meaning. Range: 0-1. Good for: open-ended text.

Python — Automated Metrics

from difflib import SequenceMatcher
import numpy as np

# Exact Match
def exact_match(output, expected):
    return 1.0 if output.strip() == expected.strip() else 0.0

# Sequence Similarity (for closeness)
def similarity_score(output, expected):
    return SequenceMatcher(None, output, expected).ratio()

# Usage:
output = "The Great Gatsby is about wealth and love in the 1920s"
expected = "A novel exploring American Dream and 1920s wealth"

score1 = exact_match(output, expected)  # 0.0
score2 = similarity_score(output, expected)  # 0.42

✅

Metric Selection

Use exact match for deterministic tasks (code, translation, math). Use ROUGE/BLEU for summarization. Use cosine similarity or LLM-as-judge for open-ended tasks where many good answers exist.

6

Building an Eval Pipeline

Here's a complete evaluation pipeline that tests a prompt and generates a score report.

Python — Complete Eval Pipeline

import anthropic
import json
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class TestCase:
    id: str
    input: str
    expected_output: str

SYSTEM_PROMPT = """You are a code reviewer..."""

def evaluate_single_case(test_case: TestCase) -> dict:
    # Get model output
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=500,
        system=SYSTEM_PROMPT,
        messages=[{
            "role": "user",
            "content": test_case.input
        }]
    )
    actual_output = response.content[0].text

    # Score using LLM-as-judge
    judge_prompt = f"""
Input: {test_case.input}
Expected: {test_case.expected_output}
Actual: {actual_output}

Score accuracy 1-10."""

    judge_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": judge_prompt
        }]
    )
    judge_output = judge_response.content[0].text

    return {
        "test_id": test_case.id,
        "actual_output": actual_output,
        "judge_score": judge_output
    }

def run_eval_pipeline(test_cases: list[TestCase]) -> dict:
    results = []
    scores = []

    for test_case in test_cases:
        result = evaluate_single_case(test_case)
        results.append(result)
        # Extract numeric score (simplified)
        score = int(result["judge_score"][0]) if result["judge_score"][0].isdigit() else 5
        scores.append(score)

    return {
        "total_tests": len(test_cases),
        "average_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "max_score": max(scores),
        "results": results
    }

# Usage:
test_data = [
    TestCase("1", "Review: def x(): pass", "Good but lacks docstring"),
    # ...
]
report = run_eval_pipeline(test_data)
print(json.dumps(report, indent=2))

7

Continuous Improvement

Evaluation is only useful if you actually use the results to improve. Here's the workflow.

The Improvement Loop

Baseline: Evaluate current prompt. Record baseline score.
Iterate: Make targeted improvements (better system prompt, few-shot examples, etc.).
Test: Run full eval pipeline on new version.
Compare: Is the score better? If yes, keep it. If no, revert or try differently.
Repeat: Go back to step 2.

Regression Testing

Keep your eval dataset and re-run it whenever you update the prompt. This catches unintended side effects where a change fixes some cases but breaks others.

Python — Track Improvements

scores = {
    "v1_baseline": 65,
    "v2_better_system_prompt": 72,
    "v3_added_examples": 78,
    "v4_clarified_format": 81
}

for version, score in scores.items():
    improvement = score - 65
    print(f"{version}: {score}/100 (+{improvement})")

✅

CI/CD for Prompts

Some teams integrate eval pipelines into CI/CD. Every commit to the prompt triggers an eval run. If the score drops below a threshold, the commit is blocked. This prevents regressions from reaching production.

✓

Check Your Understanding

Quick Quiz — 4 Questions

1. Why is systematic evaluation better than anecdotal testing?

2. What is the LLM-as-Judge pattern?

3. Which metric is best for evaluating code correctness?

4. What is regression testing in the context of prompts?

✓

Topic 10 Summary

Systematic evaluation beats anecdotal testing. Build an eval dataset with 50+ test cases (input + expected output). Choose evaluation methods: Human eval (most accurate), Automated metrics (fast), LLM-as-judge (scalable). Use the right metric for your task: Exact match for code/translation, ROUGE for summaries, cosine similarity for open-ended. Implement a pipeline that tests your prompt and scores it automatically. Use results to improve iteratively. Track scores over versions to measure progress. Run regression tests whenever you update to catch unintended side effects.

Next up → Topic 11: Project - AI-Powered Code Review Tool
Apply everything from Phase 2 — build a complete prompt-powered application from scratch.