What fine-tuning is
A pre-trained LLM has already learned language, facts and reasoning from its enormous training run. Fine-tuning continues that training — briefly, on your much smaller dataset — so the model's weights shift toward your task.
Analogy: a pre-trained model is a doctor fresh out of medical school —
broad knowledge, general skills. Fine-tuning is residency: a few months
of focused practice that turns them into a cardiologist. You didn't
re-teach biology; you specialized behavior that was already there.
The key mental model: fine-tuning changes how the model behaves (style, format, skill at a narrow task), not primarily what it knows. That single sentence decides most fine-tune-or-not debates — see the comparison below.
The decision that matters: prompt vs RAG vs fine-tune
This is the question interviews and real projects actually ask, so take it first. The three tools solve different problems:
| You need… | Reach for | Why |
|---|---|---|
| Better instructions, examples of the task | Prompting (few-shot examples, clearer system prompt) | Free, instant, no training. Always exhaust this first. |
| The model to know your data (docs, tickets, code) | RAG | Knowledge changes daily; retrieval stays current. Fine-tuning is a terrible database. |
| A behavior prompting can't reach: strict output format, brand voice, a narrow skill, a smaller/cheaper model matching a bigger one on your task | Fine-tuning | Behavior lives in weights; examples in the prompt only rent it, training buys it. |
The litmus test:
"The model doesn't KNOW something" → RAG (or wait for a newer model)
"The model doesn't DO something the way → prompt harder; if 50+ good
I want, even with examples in the prompt" examples still fail → fine-tune
And they compose: production systems often use ALL THREE —
a fine-tuned model, fed retrieved context, behind a good prompt.
Two honest costs before you start: fine-tuning freezes knowledge at training time (your data updates ≠ model updates — that's RAG's job), and a fine-tuned model can forget some general ability while specializing (called catastrophic forgetting — mild at low learning rates, real at aggressive ones).
How it works mechanically
Recall from neural networks: training = show examples, measure error, nudge weights downhill (gradient descent). Fine-tuning is literally the same loop with three changes:
- Start from the pre-trained weights instead of random ones.
- Tiny learning rate — you're refining a sculpture, not carving a new one; big steps destroy what pre-training built.
- Your dataset — hundreds to tens of thousands of examples instead of trillions of tokens.
For chat models, each training example is a conversation showing the exact behavior you want:
{"messages": [
{"role": "system", "content": "You are LandAI's report writer."},
{"role": "user", "content": "Summarize parcel #4412 for a buyer."},
{"role": "assistant", "content": "**Parcel 4412 — Buyer Summary**\n- Zoning: residential R2...\n- Flood risk: low (zone X)...\n- Verdict: suitable for single-family development."}
]}
The assistant turns are the lesson: the model learns to produce that shape of answer in that voice — exactly the formatting consistency that's hard to guarantee through prompting alone.
Full fine-tuning vs LoRA (the interview distinction)
Full fine-tuning updates every weight in the model. For a 70-billion-parameter model that means holding the model, its gradients, and optimizer state in GPU memory — hundreds of gigabytes, multi-GPU rigs, real money. And every specialized variant you train is a complete multi-hundred-GB copy.
LoRA (Low-Rank Adaptation) is the technique that made fine-tuning affordable, and the one to know cold:
Full fine-tuning: LoRA:
W (the original weights) W stays FROZEN
↓ training updates W two small matrices A·B are added
W' (a whole new model) alongside: output = W·x + (A·B)·x
↓ training updates only A and B
100% of parameters touched typically <1% of parameters trained
one new full model per task one ~50 MB "adapter" file per task —
swap adapters on the same base model
Why a low-rank (small) pair of matrices is enough: the change a fine-tune needs to make is far simpler than the model itself — you're adjusting a style, not relearning English. Empirically, LoRA matches full fine-tuning on most specialization tasks at a fraction of the cost. QLoRA pushes further: the frozen base model is quantized (weights compressed to 4-bit numbers — lossy but tolerable), letting a 13B-parameter model fine-tune on one consumer GPU.
In practice you'll meet fine-tuning at two levels:
- API fine-tuning (OpenAI/Anthropic-style): upload JSONL, the provider trains and hosts; you never see a GPU. Right answer for most product teams.
- Open-weights fine-tuning (Llama/Mistral + LoRA via Hugging Face PEFT): full control, your hardware or rented GPUs, you own the weights. Right answer for privacy, cost-at-scale, or research.
The workflow (where projects actually succeed or fail)
1. BASELINE Prompt-engineer the best you can; measure it.
(If you can't measure it, stop — see step 2.)
2. EVALS Build the test FIRST: 50–200 held-out examples + a scoring
method (exact match, rubric, or LLM-as-judge). This is the
step everyone skips and everyone regrets.
3. DATA Collect/curate training examples of the IDEAL behavior.
Quality crushes quantity: 500 excellent, consistent,
deduplicated examples beat 50,000 scraped ones. Garbage
in, garbage amplified.
4. TRAIN Start small (1–3 epochs, low learning rate, LoRA).
Watch validation loss — falling train loss with rising
validation loss = memorizing, not learning (overfitting).
5. EVALUATE Run the step-2 evals on base vs fine-tuned. Also spot-check
for regressions on general ability and safety behavior.
6. ITERATE Most gains come from FIXING DATA, not tuning
hyperparameters. Look at the failures; they're usually
inconsistencies in your own examples.
The dirty secret of the field, worth saying in interviews: fine-tuning is a data-curation discipline wearing an ML costume. The model faithfully amplifies whatever your examples teach — including their inconsistencies, biases and mistakes.
Production perspective
- The classic economic win: distill a frontier model into a small one — use the big model to generate/grade training data for your narrow task, fine-tune a small model on it, serve at a tenth of the cost and latency. ("Teacher → student" distillation; this pattern powers many production classifiers, routers and extractors.)
- Where it shines in real products: rigid structured output (extraction to your exact JSON schema), tone/brand voice at scale, domain dialects (legal, medical, your codebase's conventions), classification/routing at high volume.
- Where teams waste months: fine-tuning to inject knowledge that belongs in RAG, fine-tuning before exhausting prompting, or fine-tuning without evals and "feeling" the result.
- Serving note: LoRA adapters are swappable per-request on one hosted base model — multi-tenant fine-tunes without multi-model costs.
Common mistakes
- No eval set — you cannot improve what you don't measure; vibes are not a metric.
- Fine-tuning for knowledge — it'll hallucinate confidently in your brand voice. Knowledge → RAG; behavior → fine-tuning.
- Inconsistent training data — 500 examples written by five people with three formats teaches the model to be confused with confidence.
- Too many epochs — training loss → 0 while the model memorizes your examples verbatim and loses generality (overfitting, the same enemy as every ML system).
- Skipping the regression check — your model now formats reports perfectly and somehow forgot how to refuse harmful requests; test general + safety behavior, not just the target task.
- Treating it as one-and-done — behavior drifts as your product changes; budget for re-tuning like you budget for migrations.
Interview perspective
Practice
- Decision reps: for each, pick prompt/RAG/fine-tune and defend it in one sentence: (a) chatbot must answer from your company wiki; (b) extractor must emit your exact JSON schema 100% of the time; (c) model doesn't know a library released last month; (d) support bot should sound like your brand across 50 K tickets/day.
- Data drill: write 5 training conversations for the LandAI report style above — then critique your own consistency (format, length, tone) the way step-6 demands.
- Hands-on (API path): take 50 examples of a toy task (e.g., classify commit messages as feat/fix/chore), fine-tune a small hosted model, and compare against few-shot prompting on a held-out 20. The eval table you produce is the deliverable.
- Hands-on (open path): run a LoRA fine-tune of a 7B model with Hugging Face PEFT on the same data; note GPU memory with and without 4-bit quantization.
That completes Level 11. The full roadmap now runs unbroken from what a computer is to salary negotiation — see the roadmap.