Topic 2 — Model Selection & Comparison

1

The Model Landscape

The LLM ecosystem has evolved into a competitive landscape with several major players. Each provider brings different strengths, pricing models, and design philosophies. Understanding the landscape is the first step to making informed choices.

🟢

OpenAI (GPT)

GPT-4o — fast, multimodal flagship. o1 / o3 — reasoning models with chain-of-thought. Strong at coding, general tasks, and broad tool ecosystem.

🟠

Anthropic (Claude)

Claude 4.5 Sonnet — balanced speed and intelligence. Claude Opus 4 — top-tier reasoning and analysis. Known for safety, long context (200K), and instruction following.

🔵

Google (Gemini)

Gemini 2.5 Pro — massive 1M token context window. Gemini 2.5 Flash — fast and cost-effective. Native multimodal support for text, images, audio, and video.

🟣

Meta (Llama)

Llama 4 — leading open-source family. Free to use, can run locally or on your own infrastructure. Strong community, fine-tuning ecosystem, and full data control.

💡

Analogy: Choosing a Vehicle

Picking an LLM is like choosing a vehicle. A sports car (GPT-4o) is fast and flashy. A luxury sedan (Claude Opus 4) is refined and comfortable for long trips. An SUV (Gemini 2.5 Pro) handles massive cargo. And building your own kit car (Llama 4) gives you total control — but you need to do the assembly.

2

Decision Factors

Choosing the right model isn't about finding the "best" one — it's about finding the best fit for your specific requirements. Here are the key factors to evaluate:

Factor	Description	Example
Quality / Intelligence	How well the model handles complex reasoning, nuance, and accuracy	Claude Opus 4 and GPT-4o lead on hard reasoning tasks
Speed / Latency	Time to first token and overall generation speed	Gemini Flash and GPT-4o-mini respond in under 500ms
Cost (per token)	Input and output token pricing — adds up fast at scale	GPT-4o-mini is ~50x cheaper than GPT-4o per token
Context Window	Maximum tokens the model can process in a single request	Gemini 2.5 Pro: 1M tokens, Claude: 200K, GPT-4o: 128K
Multimodal Support	Ability to process images, audio, video, and files	Gemini handles video natively; GPT-4o and Claude handle images
Safety / Alignment	How well the model avoids harmful outputs and follows guidelines	Claude is known for strong alignment; important for regulated industries
Open Source vs Proprietary	Whether you can inspect, modify, and self-host the model	Llama 4 is open-source; GPT-4o and Claude are proprietary APIs

✅

Prioritize Your Constraints

Most teams only have 2-3 factors that truly matter. A startup building a chatbot cares most about cost and speed. A legal firm processing contracts cares about accuracy and context window. Identify your non-negotiables first.

3

Benchmarks & Evaluation

The AI community uses standardized benchmarks to compare models. While imperfect, they provide a useful starting point for understanding model capabilities.

Benchmark	What It Measures	Why It Matters
MMLU	Massive Multitask Language Understanding — 57 academic subjects	Broad knowledge across domains (history, science, law, etc.)
HumanEval	Code generation — writing Python functions from docstrings	Critical for coding assistants and developer tools
GPQA	Graduate-level science questions (physics, chemistry, biology)	Tests deep expert-level reasoning
MATH	Competition-level mathematics problems	Tests formal reasoning and step-by-step problem solving
Arena ELO	Human preference rankings from blind A/B comparisons (Chatbot Arena)	The closest proxy to "what do real users prefer?"

⚠️

Benchmarks Don't Tell the Whole Story

Benchmarks measure general tasks, but real-world performance often differs from benchmark scores. A model that scores 92% on MMLU might still struggle with your specific domain. Models can also be optimized for benchmarks in ways that don't generalize. Always treat benchmarks as a starting point, not a final answer.

✅

Evaluate on YOUR Use Case

The most reliable evaluation is testing 2-3 candidate models on your actual data and tasks. Create a set of 20-50 representative examples from your real workload, run them through each model, and compare the outputs. This "vibe check" is worth more than any benchmark leaderboard.

4

Open Source vs Proprietary

One of the biggest decisions in model selection is whether to use a proprietary API (OpenAI, Anthropic) or an open-source model (Llama, Mistral) that you can host yourself. Each path has distinct trade-offs.

💰

Cost

Proprietary: Pay per token, predictable pricing, no infra overhead. Open-source: Free model weights, but you pay for GPU compute and ops.

🔒

Privacy / Control

Proprietary: Data goes to a third-party API. Open-source: Full control — data never leaves your infrastructure. Critical for healthcare, finance, and government.

🔧

Customization

Proprietary: Limited to fine-tuning APIs and prompt engineering. Open-source: Full fine-tuning, LoRA adapters, quantization, and architecture modifications.

🏆

Performance

Proprietary: Generally highest raw quality on hard tasks. Open-source: Catching up fast — often sufficient for focused, narrow use cases.

Here is how to get started with a local open-source model using Ollama:

Bash

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 4
ollama pull llama4
ollama run llama4 "Explain quantum computing in 2 sentences"

And you can integrate it into Python applications just as easily:

Python

# Install: pip install ollama
import ollama

# Chat with a local model — no API key needed
response = ollama.chat(
    model="llama4",
    messages=[
        {"role": "user", "content": "Explain quantum computing in 2 sentences"}
    ]
)

print(response["message"]["content"])

# List available local models
models = ollama.list()
for m in models["models"]:
    print(f"{m['name']} — {m['size'] // 1_000_000_000}GB")

💡

Analogy: Renting vs Owning

Using a proprietary API is like renting an apartment — easy to start, someone else handles maintenance, but you follow their rules. Running an open-source model is like owning a house — more upfront work, but you can renovate however you want, and nobody can raise your rent.

5

Practical Selection Framework

Use this step-by-step framework to systematically choose the right model for any project. Don't skip straight to "which model is best" — start with your requirements.

🎯

Define Task

What exactly will the model do?

→

📏

Set Constraints

Budget, latency, privacy

→

📋

Shortlist

Pick 2-3 candidates

→

🧪

Benchmark

Test on YOUR data

→

🚀

Deploy

Ship and monitor

Here are common use cases mapped to recommended models:

Use Case	Recommended Models	Why
Code generation	Claude Sonnet 4.5 / GPT-4o	Strong coding benchmarks, good at following complex specs
Creative writing	Claude Opus 4 / GPT-4o	Nuanced language, strong stylistic control, deep reasoning
Data extraction	GPT-4o-mini / Claude Haiku	Fast, cheap, and reliable for structured output at high volume
Long document analysis	Gemini 2.5 Pro (1M) / Claude (200K)	Massive context windows handle entire books or codebases
On-premise / privacy	Llama 4 / Mistral	Open-source weights, full data control, no API dependency
Cost-sensitive	GPT-4o-mini / Claude Haiku / open-source	Smallest models that meet quality bar — orders of magnitude cheaper

🧪 Try It: Model Picker

Select the factor that matters MOST for your project:

6

Key Mental Models

Keep these principles in mind whenever you're evaluating or choosing a model. They'll save you from common pitfalls.

🎯

No Best Model

The right model depends on your specific task, budget, and constraints. A $20/month model can outperform a $200/month model for the right use case.

🧪

Test Before You Commit

Always benchmark 2-3 models on YOUR actual data before making a decision. What works for others may not work for you.

💡

Cost ≠ Quality

Smaller, cheaper models often outperform larger ones for narrow, well-defined tasks. Don't default to the most expensive option.

🔄

The Landscape Shifts Fast

Today's best model may be surpassed next month. Build your system so you can swap models easily — avoid tight coupling to any single provider.

✓

Check Your Understanding

Quick Quiz — 3 Questions

1. Which factor matters MOST when choosing between GPT-4o and Claude Sonnet for a code review tool?

2. When would you choose an open-source model over a proprietary one?

3. Why shouldn't you rely solely on benchmarks to choose a model?

✓

Topic 2 Summary

Here's what you've learned:

The LLM landscape spans four major families — OpenAI, Anthropic, Google, and Meta's open-source models. Choosing between them depends on your specific constraints: quality, speed, cost, context window, privacy, and multimodal needs. Benchmarks are a useful starting point but no substitute for testing on your own data. Open-source models give you full control at the cost of operational complexity. The most important principle: there is no universally "best" model — only the best model for your task.

Next up → Topic 3: Prompt Structure Basics
You'll learn about roles, system prompts, instruction design, and how to structure prompts that consistently get great results.