Learn the systematic process of improving prompts — from first draft to production-ready.
Prompt engineering is not a one-shot process. Great prompts are built through iteration: Write → Test → Analyze → Refine → Repeat. This cycle improves output quality exponentially.
The Iteration Loop (Detail):
Prompt engineering is like writing code. Your first draft won't be perfect. You test, debug, refactor, and iterate. The difference: with prompts, your "tests" are small API calls, not full test suites.
The Loop in Practice:
import anthropic client = anthropic.Anthropic() # Test cases test_cases = [ ("I love this!", "positive"), ("Terrible product.", "negative"), ("It's okay.", "neutral"), ] def test_prompt(prompt, test_data): """Run prompt on test cases and check accuracy.""" correct = 0 for text, expected in test_data: response = client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{ "role": "user", "content": f"{prompt}\n\nText: {text}" }] ) result = response.content[0].text.strip().lower() if expected in result: correct += 1 accuracy = (correct / len(test_data)) * 100 return accuracy # Iteration cycle prompt_v1 = "Classify sentiment: positive, negative, or neutral." acc_v1 = test_prompt(prompt_v1, test_cases) print(f"V1: {acc_v1}%") # Analyze and improve prompt_v2 = """Classify sentiment. Examples: "love" → positive, "bad" → negative. Return ONE word: positive, negative, or neutral.""" acc_v2 = test_prompt(prompt_v2, test_cases) print(f"V2: {acc_v2}%")
Define success metrics upfront (accuracy, latency, cost). Without metrics, you're iterating blindly. Measure before and after each change.
Most prompt failures fall into predictable patterns. Learn to recognize them, and you'll debug faster.
Failure Mode 1: Too Verbose Prompts
Your job is to carefully read the text and think deeply about whether the sentiment is positive or negative. Please consider all nuances...
Classify sentiment: positive or negative. Be concise. One word answer only.
Why it fails: Long prompts confuse the model and waste tokens. How to fix: Remove filler words. Be specific, not verbose.
Failure Mode 2: Hallucinations
What is the annual revenue of company X?
Based ONLY on this text, extract the annual revenue. If not mentioned, say "Not found".
Why it fails: Model makes up information when uncertain. How to fix: Tell it what to do when info is unavailable.
Failure Mode 3: Wrong Format Output
Return the data as JSON.
Return valid JSON (no markdown).
{
"name": "...",
"score": 0-100
}
Why it fails: "JSON" is ambiguous. Model might return markdown-wrapped JSON. How to fix: Show exact expected format with an example.
Failure Mode 4: Ignoring Instructions
Don't be too creative. Just summarize the text factually.
Summarize ONLY facts from the text. Do NOT add interpretations or creative additions.
Why it fails: Weak negatives ("don't") are less effective. How to fix: Use system prompt for hard rules. Be explicit.
Failure Mode 5: Lazy Responses
Summarize the meeting.
Summarize the meeting covering: 1. Key decisions (at least 3) 2. Action items with owners 3. Next meeting date Format: markdown with headers
Why it fails: Vague instructions allow low-effort responses. How to fix: Specify exactly what you want. Set expectations.
When you notice a failure, check which mode it is. Then apply the targeted fix. Don't just say "the prompt is bad" — diagnose the specific problem.
When a prompt fails, you need to isolate the cause. Is it the system prompt? The input data? The output format spec? Debugging is detective work.
Systematic Debugging Process:
Debugging Example: "Model is returning markdown when I want JSON"
# Debug: Is it the format spec or the system prompt? # Test 1: Just format spec (no system prompt) response_1 = client.messages.create( model="claude-sonnet-4-5-20250929", messages=[{ "role": "user", "content": "Return valid JSON: {\"name\": \"...\"}" }] ) print("Test 1 (no system):", response_1.content[0].text) # Test 2: With system prompt that says to be helpful response_2 = client.messages.create( model="claude-sonnet-4-5-20250929", system="Be helpful", messages=[{ "role": "user", "content": "Return valid JSON: {\"name\": \"...\"}" }] ) print("Test 2 (vague system):", response_2.content[0].text) # Test 3: Strong system prompt about JSON only response_3 = client.messages.create( model="claude-sonnet-4-5-20250929", system="Return ONLY valid JSON. No markdown. No explanation.", messages=[{ "role": "user", "content": "Return valid JSON: {\"name\": \"...\"}" }] ) print("Test 3 (strong system):", response_3.content[0].text) # Conclusion: Hypothesis was right! System prompt was too weak.
Never change multiple things at once. Change one variable, measure, then move to the next. Otherwise, you won't know which change caused the improvement.
When you have two competing prompt versions, A/B test them on your dataset. Don't rely on gut feeling. Let data decide.
A/B Testing Process:
Python: A/B Testing Code
import json from anthropic import Anthropic client = Anthropic() # Test dataset test_data = [ {"text": "I love this!", "expected": "positive"}, {"text": "Awful experience.", "expected": "negative"}, {"text": "It's okay.", "expected": "neutral"}, ] def test_prompt_version(prompt, dataset): """Test a prompt and return accuracy.""" results = [] for item in dataset: response = client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=100, messages=[{ "role": "user", "content": f"{prompt}\n\nText: {item['text']}" }] ) result = response.content[0].text.strip().lower() correct = item["expected"] in result results.append({ "input": item["text"], "expected": item["expected"], "output": result, "correct": correct }) accuracy = (sum(1 for r in results if r["correct"])) / len(results) * 100 return accuracy, results # Prompt A (basic) prompt_a = "Classify sentiment: positive, negative, or neutral." # Prompt B (improved with examples) prompt_b = """Classify sentiment: positive, negative, neutral. Examples: - "love" → positive - "hate" → negative - "ok" → neutral Return ONE word only.""" # Run A/B test acc_a, results_a = test_prompt_version(prompt_a, test_data) acc_b, results_b = test_prompt_version(prompt_b, test_data) print(f"Prompt A accuracy: {acc_a}%") print(f"Prompt B accuracy: {acc_b}%") print(f"Winner: {'B' if acc_b > acc_a else 'A'}")
Small improvements (1-2%) might be noise. Look for bigger gaps (5%+) before declaring a winner. On larger datasets (1000+ examples), smaller improvements become meaningful.
Track your prompt versions like you track code. Versioning lets you compare, rollback, and understand what changed and why. This is essential in production.
Prompt Changelog Example:
## Sentiment Classifier Prompt Changelog
### v1.0 (2025-01-15)
- Initial: "Classify sentiment: positive, negative, neutral"
- Accuracy: 72%
- Issue: Too vague, model guesses
### v1.1 (2025-01-16)
- Added few-shot examples
- Added "Return ONE word only" constraint
- Accuracy: 85%
- Improvement: +13%
### v1.2 (2025-01-17)
- Moved examples to system prompt
- Added "If unsure, say 'unclear'"
- Accuracy: 87%
- Improvement: +2%
### v2.0 (2025-01-18)
- Full XML structure with tags
- Explicit JSON output format
- Accuracy: 89%
- Improvement: +2%, better production-grade
Best Practices for Versioning:
Keep prompts in Git (or equivalent) alongside your code. Tag releases. This lets you roll back to a previous version if a new prompt performs worse.
Let's walk through a real example of iterating on a code review prompt from terrible (v1) to excellent (v5). Notice the pattern of progressive improvements.
V1 (Terrible) — Initial Attempt
Review this code.
Problem: No specificity. Model gives generic feedback. Accuracy: 30% (misses bugs, gives irrelevant comments)
V2 (Okay) — Add Some Context
You are a code reviewer.
Review this Python code for bugs and improvements.
List issues and suggestions.
Code:
[code here]
Improvement: More specific, but still loose. Accuracy: 60% (catches some issues, but output is rambling)
V3 (Better) — Add Format & Examples
You are a senior Python engineer.
Review code focusing on:
1. Security issues (highest priority)
2. Performance problems
3. Code quality improvements
Output format:
## Security Issues
[list with severity]
## Performance
[specific suggestions]
## Code Quality
[improvements with examples]
Code:
[code here]
Improvement: Clear format, prioritization. Accuracy: 75% (catches most issues, better organized)
V4 (Excellent) — Add XML Tags & Guardrails
<system_role> You are a senior Python engineer at a Fortune 500 company with 15 years of experience. Your code reviews are known for catching subtle bugs and suggesting high-impact improvements. </system_role> <instructions> Review this Python code. Flag issues in priority order. </instructions> <priorities> 1. Security vulnerabilities (SQL injection, etc.) 2. Race conditions or concurrency issues 3. Memory leaks or performance problems 4. Code quality and style issues </priorities> <output_format> ## 🔴 Critical Issues - Issue 1 (line X): description - Issue 2 (line Y): description ## 🟡 Warnings - Warning 1: description ## 💡 Improvements - Suggestion 1: description ## ✅ Positive Notes - What's good about the code </output_format> <rules> - Cite specific line numbers - Provide code examples for fixes - Be constructive, not dismissive - Don't comment on style preferences - If you're unsure, say "needs clarification" </rules> Code: [code here]
Improvement: Strong role, clear rules, structured format. Accuracy: 88% (catches nearly all issues, highly useful)
V5 (Optimal) — Add Few-Shot & Error Handling
<system_role> You are a senior Python engineer at a Fortune 500 company with 15 years of experience. Your code reviews are known for catching subtle bugs and suggesting high-impact improvements. </system_role> <examples> Example 1: SQL injection vulnerability ```python result = db.execute(f"SELECT * FROM users WHERE id = {user_id}") ``` Review: "🔴 Critical security issue on line 2. This is vulnerable to SQL injection. Fix: Use parameterized queries." Example 2: Good code ```python def calculate_total(items: list[float]) -> float: """Calculate sum of items with validation.""" return sum(item for item in items if isinstance(item, (int, float))) ``` Review: "✅ Good: Type hints, docstring, and defensive code." </examples> <instructions> Review this Python code. Flag issues in priority order. </instructions> ... [rest of v4 structure] ... Code: [code here]
Improvement: Few-shot examples teach desired behavior. Accuracy: 92% (production-ready, consistent, actionable)
Notice each version builds on the previous: v1 → v2 (add context) → v3 (structure) → v4 (XML tags, guardrails) → v5 (examples). This is the path most prompts follow to production.
1. What are the steps of the prompt iteration loop?
2. Which failure mode is fixed by adding "Return ONLY JSON. No markdown"?
3. When debugging a prompt, what should you do?
4. Why should you version control your prompts?
Here's what you've learned:
Iteration is everything. Start simple, test, and refine. Failure modes are predictable. Learn to recognize and fix them. Debugging is systematic. Isolate, hypothesize, test one change at a time. A/B testing eliminates guessing. Let data decide between versions. Versioning prevents chaos. Track changes like code. Case studies show the path. Most prompts follow a v1 → v5 progression to production.
Next up → Topic 6: Hands-On — Building with LLM APIs
Write real Python code to call Claude and OpenAI APIs with streaming, error handling, and practical patterns.