Measure, score, and systematically improve your prompts — because you can't improve what you can't measure.
Many teams rely on "gut feel" to judge prompt quality: "That feels right" or "I tried a few versions and this one seemed best." This is anecdotal and unreliable. In production, you need systematic evaluation.
Testing on 5 examples doesn't tell you if your prompt works on 100,000. A prompt that "feels good" might actually be brittle. You need data and metrics, not intuition.
Measure improvements: Know if a change actually helped. Catch regressions: Detect when updates break previous functionality. Compare versions: Objectively choose between prompt variants. Track trends: See if quality improves or degrades over time. Build confidence: Deploy to production knowing your prompt has been validated.
Create a test dataset (20-50+ examples) with expected outputs. Run your prompt against all test cases. Score the results automatically or with human eval. Track scores over time as you iterate.
There are three main approaches to evaluating prompts. Each has trade-offs.
Pros: Most accurate, catches nuance. Cons: Expensive, slow, doesn't scale to thousands of examples.
Pros: Fast, cheap, reproducible. Cons: Can miss subtle issues, requires good metric selection.
Pros: Understands nuance, scales well. Cons: Expensive, can be biased by judge model's behavior.
Pros: Best of all worlds. Cons: Most complex to set up.
Human eval: A food critic tastes the dish. Automated metrics: Check if it's hot, pH is neutral, etc. LLM-as-judge: Another chef tastes it and scores it. Hybrid: Metrics + one critic for edge cases.
The foundation of good evaluation is a good test dataset. This takes time to create but pays off.
Structure: Input → Expected Output
For each test case, you need an input (prompt) and the expected or ideal output. Store as JSON for easy programmatic access.
[
{
"id": "1",
"input": "Summarize: The Great Gatsby is a novel...",
"expected_output": "A 1920s-set novel about wealth, love, and the American Dream",
"category": "literature"
},
{
"id": "2",
"input": "Translate to Spanish: Hello, how are you?",
"expected_output": "Hola, ¿cómo estás?",
"category": "translation"
},
{
"id": "3",
"input": "Fix this code: def add(a, b return a + b",
"expected_output": "def add(a, b):\n return a + b",
"category": "code"
}
]
Dataset Size Recommendations
Don't just test happy paths. Include edge cases: empty inputs, very long inputs, ambiguous requests, requests in different languages, etc. This catches bugs before they hit production.
Use Claude itself to judge the quality of outputs. It's surprisingly effective and understands nuance that simple metrics miss.
The Judge Prompt
Write a system prompt that describes what good output looks like. Give it the input and output to evaluate. Ask it to score and explain.
import anthropic import json judge_system = """ You are an expert evaluator of code quality. Score code on: 1. Correctness (does it work?) 2. Readability (is it clear?) 3. Efficiency (is it optimal?) Return a JSON object with: - score (1-10) - reasoning (brief explanation) - issues (list of problems, or empty) """ client = anthropic.Anthropic() def evaluate_code(code_to_review): judge_input = f""" Evaluate this code: ```python {code_to_review} ``` Return JSON with score, reasoning, and issues.""" response = client.messages.create( model="claude-opus-4-6", max_tokens=500, system=judge_system, messages=[{ "role": "user", "content": judge_input }] ) # Parse the JSON response result_text = response.content[0].text try: result = json.loads(result_text) return result["score"] except: print("Failed to parse judge response") return None
The judge model can have bias. Different models might score differently. To mitigate: use the same judge model for all evaluations, be explicit in the judge prompt, double-check with human validation on a subset.
For certain tasks, automated metrics work well. They're fast and deterministic. Choose the right metric for your task.
Output == expected? Binary: yes/no. Good for: translation, structured output. Bad for: open-ended tasks.
N-gram overlap with reference. Standard for translation. Range: 0-100. Higher is better.
Recall-oriented for summarization. Measures overlap with reference summaries. Good for: summary eval.
Embedding-based similarity. Captures semantic meaning. Range: 0-1. Good for: open-ended text.
from difflib import SequenceMatcher import numpy as np # Exact Match def exact_match(output, expected): return 1.0 if output.strip() == expected.strip() else 0.0 # Sequence Similarity (for closeness) def similarity_score(output, expected): return SequenceMatcher(None, output, expected).ratio() # Usage: output = "The Great Gatsby is about wealth and love in the 1920s" expected = "A novel exploring American Dream and 1920s wealth" score1 = exact_match(output, expected) # 0.0 score2 = similarity_score(output, expected) # 0.42
Use exact match for deterministic tasks (code, translation, math). Use ROUGE/BLEU for summarization. Use cosine similarity or LLM-as-judge for open-ended tasks where many good answers exist.
Here's a complete evaluation pipeline that tests a prompt and generates a score report.
import anthropic import json from dataclasses import dataclass client = anthropic.Anthropic() @dataclass class TestCase: id: str input: str expected_output: str SYSTEM_PROMPT = """You are a code reviewer...""" def evaluate_single_case(test_case: TestCase) -> dict: # Get model output response = client.messages.create( model="claude-opus-4-6", max_tokens=500, system=SYSTEM_PROMPT, messages=[{ "role": "user", "content": test_case.input }] ) actual_output = response.content[0].text # Score using LLM-as-judge judge_prompt = f""" Input: {test_case.input} Expected: {test_case.expected_output} Actual: {actual_output} Score accuracy 1-10.""" judge_response = client.messages.create( model="claude-opus-4-6", max_tokens=100, messages=[{ "role": "user", "content": judge_prompt }] ) judge_output = judge_response.content[0].text return { "test_id": test_case.id, "actual_output": actual_output, "judge_score": judge_output } def run_eval_pipeline(test_cases: list[TestCase]) -> dict: results = [] scores = [] for test_case in test_cases: result = evaluate_single_case(test_case) results.append(result) # Extract numeric score (simplified) score = int(result["judge_score"][0]) if result["judge_score"][0].isdigit() else 5 scores.append(score) return { "total_tests": len(test_cases), "average_score": sum(scores) / len(scores), "min_score": min(scores), "max_score": max(scores), "results": results } # Usage: test_data = [ TestCase("1", "Review: def x(): pass", "Good but lacks docstring"), # ... ] report = run_eval_pipeline(test_data) print(json.dumps(report, indent=2))
Evaluation is only useful if you actually use the results to improve. Here's the workflow.
The Improvement Loop
Regression Testing
Keep your eval dataset and re-run it whenever you update the prompt. This catches unintended side effects where a change fixes some cases but breaks others.
scores = {
"v1_baseline": 65,
"v2_better_system_prompt": 72,
"v3_added_examples": 78,
"v4_clarified_format": 81
}
for version, score in scores.items():
improvement = score - 65
print(f"{version}: {score}/100 (+{improvement})")
Some teams integrate eval pipelines into CI/CD. Every commit to the prompt triggers an eval run. If the score drops below a threshold, the commit is blocked. This prevents regressions from reaching production.
1. Why is systematic evaluation better than anecdotal testing?
2. What is the LLM-as-Judge pattern?
3. Which metric is best for evaluating code correctness?
4. What is regression testing in the context of prompts?
Systematic evaluation beats anecdotal testing. Build an eval dataset with 50+ test cases (input + expected output). Choose evaluation methods: Human eval (most accurate), Automated metrics (fast), LLM-as-judge (scalable). Use the right metric for your task: Exact match for code/translation, ROUGE for summaries, cosine similarity for open-ended. Implement a pipeline that tests your prompt and scores it automatically. Use results to improve iteratively. Track scores over versions to measure progress. Run regression tests whenever you update to catch unintended side effects.
Next up → Topic 11: Project - AI-Powered Code Review Tool
Apply everything from Phase 2 — build a complete prompt-powered application from scratch.