Fine-Tuning

Teaching a pre-trained model new behavior — full fine-tuning vs LoRA, when to fine-tune vs RAG vs prompting, data preparation, and evaluation.

Starting from Zero — A Physical Intuition

Before looking at training equations, let's understand fine-tuning and LoRA through two physical analogies:

The Doctor Residency Analogy (Fine-Tuning): A pre-trained base model is like a doctor who has just graduated from medical school. They possess broad knowledge of biology, chemistry, and general medicine. Fine-tuning is their residency: a few months of highly focused training in cardiology. You don't re-teach them basic human anatomy (pre-training); you specialize their behavioral habits and vocabulary for a specific clinic.
The Piano Adapter Analogy (LoRA): Full fine-tuning is like restringing and retuning every single string on a grand piano to play jazz instead of classical—expensive and time-consuming. LoRA (Low-Rank Adaptation) is like attaching a temporary adapter clamp to a few keys. The base piano remains untouched (frozen), but the clamp slightly redirects a few notes to create the jazz sound. When you want classical again, you simply unclip the adapter.

What fine-tuning is

A pre-trained LLM has already learned language, facts and reasoning from its enormous training run. Fine-tuning continues that training — briefly, on your much smaller dataset — so the model's weights shift toward your task.

Analogy: a pre-trained model is a doctor fresh out of medical school —
broad knowledge, general skills. Fine-tuning is residency: a few months
of focused practice that turns them into a cardiologist. You didn't
re-teach biology; you specialized behavior that was already there.

The key mental model: fine-tuning changes how the model behaves (style, format, skill at a narrow task), not primarily what it knows. That single sentence decides most fine-tune-or-not debates — see the comparison below.

The decision that matters: prompt vs RAG vs fine-tune

This is the question interviews and real projects actually ask, so take it first. The three tools solve different problems:

You need…	Reach for	Why
Better instructions, examples of the task	Prompting (few-shot examples, clearer system prompt)	Free, instant, no training. Always exhaust this first.
The model to know your data (docs, tickets, code)	RAG	Knowledge changes daily; retrieval stays current. Fine-tuning is a terrible database.
A behavior prompting can't reach: strict output format, brand voice, a narrow skill, a smaller/cheaper model matching a bigger one on your task	Fine-tuning	Behavior lives in weights; examples in the prompt only rent it, training buys it.

The litmus test:

  "The model doesn't KNOW something"        → RAG (or wait for a newer model)
  "The model doesn't DO something the way   → prompt harder; if 50+ good
   I want, even with examples in the prompt"   examples still fail → fine-tune

  And they compose: production systems often use ALL THREE —
  a fine-tuned model, fed retrieved context, behind a good prompt.

Two honest costs before you start: fine-tuning freezes knowledge at training time (your data updates ≠ model updates — that's RAG's job), and a fine-tuned model can forget some general ability while specializing (called catastrophic forgetting — mild at low learning rates, real at aggressive ones).

How it works mechanically

Recall from neural networks: training = show examples, measure error, nudge weights downhill (gradient descent). Fine-tuning is literally the same loop with three changes:

Start from the pre-trained weights instead of random ones.
Tiny learning rate — you're refining a sculpture, not carving a new one; big steps destroy what pre-training built.
Your dataset — hundreds to tens of thousands of examples instead of trillions of tokens.

For chat models, each training example is a conversation showing the exact behavior you want:

{"messages": [
  {"role": "system", "content": "You are LandAI's report writer."},
  {"role": "user", "content": "Summarize parcel #4412 for a buyer."},
  {"role": "assistant", "content": "**Parcel 4412 — Buyer Summary**\n- Zoning: residential R2...\n- Flood risk: low (zone X)...\n- Verdict: suitable for single-family development."}
]}

The assistant turns are the lesson: the model learns to produce that shape of answer in that voice — exactly the formatting consistency that's hard to guarantee through prompting alone.

Full fine-tuning vs LoRA (the interview distinction)

Full fine-tuning updates every weight in the model. For a 70-billion-parameter model that means holding the model, its gradients, and optimizer state in GPU memory — hundreds of gigabytes, multi-GPU rigs, real money. And every specialized variant you train is a complete multi-hundred-GB copy.

LoRA (Low-Rank Adaptation) is the technique that made fine-tuning affordable, and the one to know cold:

Full fine-tuning:                LoRA:

  W  (the original weights)       W stays FROZEN
  ↓ training updates W            two small matrices A·B are added
  W' (a whole new model)          alongside: output = W·x + (A·B)·x
                                  ↓ training updates only A and B

  100% of parameters touched      typically <1% of parameters trained
  one new full model per task     one ~50 MB "adapter" file per task —
                                  swap adapters on the same base model

Why a low-rank (small) pair of matrices is enough: the change a fine-tune needs to make is far simpler than the model itself — you're adjusting a style, not relearning English. Empirically, LoRA matches full fine-tuning on most specialization tasks at a fraction of the cost. QLoRA pushes further: the frozen base model is quantized (weights compressed to 4-bit numbers — lossy but tolerable), letting a 13B-parameter model fine-tune on one consumer GPU.

PEFT (Parameter-Efficient Fine-Tuning) Methods Comparison

Method	How it Works	Trained Params	Hardware Req.	Primary Use Case
LoRA	Injects low-rank adapters ($A$ & $B$) next to frozen weight matrices.	< 1%	Medium (Base model size)	Standard task specialization, style alignment
QLoRA	Quantizes the base model to 4-bit NormalFloat, then adds 16-bit LoRA adapters.	< 1%	Low (Fits 13B model on a single consumer GPU)	Cost-sensitive training of open-weights models
Prefix Tuning	Prepends trainable continuous vectors (virtual tokens) directly to key-value (KV) states in attention layers.	1% - 2%	Medium	Multi-task learning (keep base model frozen, swap prefix vectors)
Prompt Tuning	Prepends trainable prompt embeddings (virtual tokens) to the input sequence only.	< 0.1%	Low to Medium	High-throughput classification where system prompts can be learned as vectors

LoRA Forward Pass in Code

Here is a complete Python implementation demonstrating how a LoRA layer calculates its forward pass using low-rank matrices A and B alongside a frozen weight matrix W:

Python

import numpy as np
from typing import Tuple

class LinearLayerWithLoRA:
    def __init__(self, in_features: int, out_features: int, rank: int = 4):
        # 1. Base weights: simulated frozen pre-trained weight matrix W
        self.W = np.random.randn(out_features, in_features) * 0.1
        
        # 2. LoRA matrices: A and B
        # B is initialized to zeros so the adapter contributes nothing at start
        self.A = np.random.randn(rank, in_features) * 0.1
        self.B = np.zeros((out_features, rank))
        
        # Scaling factor: alpha / rank
        self.alpha = 8.0
        self.scaling = self.alpha / rank

    def forward(self, x: np.ndarray) -> np.ndarray:
        # Base forward pass (completely frozen)
        base_output = np.dot(self.W, x)
        
        # LoRA forward pass: (B @ A) @ x * scaling
        # Evaluated as B @ (A @ x) to maintain O(d * r) compute complexity
        lora_output = np.dot(self.B, np.dot(self.A, x)) * self.scaling
        
        # Combined output
        return base_output + lora_output

# Instantiate a layer: 8 input dims, 8 output dims, rank-2 adapter
layer = LinearLayerWithLoRA(in_features=8, out_features=8, rank=2)
x_input = np.random.randn(8)

# Output before training (adapter contributes zero)
out_initial = layer.forward(x_input)

# Simulate training: update B to non-zero values
layer.B = np.random.randn(8, 2) * 0.1
out_adapted = layer.forward(x_input)

print("Initial output (W only):\n", out_initial)
print("Adapted output (W + LoRA A*B):\n", out_adapted)

Dataset Format (JSONL) & Validation

For instruction tuning, fine-tuning datasets are written in JSON Lines (JSONL) format, where each line is a self-contained JSON object representing a complete conversation history.

Before submitting a dataset to a training cluster, it is standard practice to validate the file structure, checking roles, keys, and token lengths to prevent run-time crashes.

Python

# validate_dataset.py
import json

def validate_jsonl(file_path: str):
    valid_count = 0
    errors = []
    
    with open(file_path, "r") as f:
        for idx, line in enumerate(f):
            try:
                data = json.loads(line.strip())
                if "messages" not in data:
                    errors.append(f"Line {idx}: Missing 'messages' key")
                    continue
                
                messages = data["messages"]
                if not isinstance(messages, list):
                    errors.append(f"Line {idx}: 'messages' must be a list")
                    continue
                
                # Check formatting of each conversation message
                user_count, assistant_count = 0, 0
                for msg in messages:
                    if "role" not in msg or "content" not in msg:
                        errors.append(f"Line {idx}: Message missing 'role' or 'content'")
                        continue
                    if msg["role"] == "user":
                        user_count += 1
                    elif msg["role"] == "assistant":
                        assistant_count += 1
                
                if user_count == 0 or assistant_count == 0:
                    errors.append(f"Line {idx}: Must contain at least one user and one assistant message")
                else:
                    valid_count += 1
                    
            except json.JSONDecodeError:
                errors.append(f"Line {idx}: Invalid JSON syntax")
                
    print(f"Validation summary: {valid_count} valid rows, {len(errors)} errors.")
    for err in errors[:5]:
        print(f"  - ERROR: {err}")

# Example invocation:
# validate_jsonl("my_training_data.jsonl")

Fine-Tuning Pipeline with Hugging Face (`SFTTrainer`)

Below is a complete, production-ready python script showing how to load a model using 4-bit quantization (QLoRA), apply a LoRA configuration, and run supervised fine-tuning (SFT) using Hugging Face's trl library.

Python

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# 1. Load dataset from JSONL file
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# 2. QLoRA Quantization Config (loads weights in 4-bit, performs compute in bf16)
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",               # NormalFloat4
    bnb_4bit_use_double_quant=True,         # Quantizes quantization constants
    bnb_4bit_compute_dtype=torch.bfloat16   # Training compute datatype
)

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# 3. Prepare quantized model for training
model = prepare_model_for_kbit_training(model)

# 4. LoRA Adapter Config
peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # Target self-attention layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 5. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./qlora_output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    max_steps=100,
    bf16=True,                               # Use bfloat16 for training stability
    optim="paged_adamw_8bit"                 # Keeps optimizer states off RAM/VRAM
)

# 6. SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
    dataset_text_field="messages",           # Uses default chat template wrapping
)

# Start training
# trainer.train()

Think it through like the interview

Don't just launch a training run — derive whether your task needs domain knowledge injection or instruction adaptation.

Think it through: Instruction Tuning vs Continual Pre-trainingTraining Regimes & Objectives0/3 stages

PROBLEMDesign the training strategy to specialize an LLM for two tasks: (1) answering questions about proprietary internal medical trials, and (2) outputting text in a strict patient-chart JSON schema.

1
Evaluate Task 1: Medical Trials
“Can we use LoRA instruction tuning to teach the model all the facts, values, and drug names from our medical trials?”
2
Evaluate Task 2: Strict JSON schema
“Why is LoRA instruction tuning the perfect match to teach the model to output a strict patient-chart JSON schema?”
unlocks after the stage above
3
Synthesize the production architecture
“How do we combine both strategies to build a complete medical trials assistant?”
unlocks after the stage above

In practice you'll meet fine-tuning at two levels:

API fine-tuning (OpenAI/Anthropic-style): upload JSONL, the provider trains and hosts; you never see a GPU. Right answer for most product teams.
Open-weights fine-tuning (Llama/Mistral + LoRA via Hugging Face PEFT): full control, your hardware or rented GPUs, you own the weights. Right answer for privacy, cost-at-scale, or research.

The workflow (where projects actually succeed or fail)

1. BASELINE   Prompt-engineer the best you can; measure it.
              (If you can't measure it, stop — see step 2.)

2. EVALS      Build the test FIRST: 50–200 held-out examples + a scoring
              method (exact match, rubric, or LLM-as-judge). This is the
              step everyone skips and everyone regrets.

3. DATA       Collect/curate training examples of the IDEAL behavior.
              Quality crushes quantity: 500 excellent, consistent,
              deduplicated examples beat 50,000 scraped ones. Garbage
              in, garbage amplified.

4. TRAIN      Start small (1–3 epochs, low learning rate, LoRA).
              Watch validation loss — falling train loss with rising
              validation loss = memorizing, not learning (overfitting).

5. EVALUATE   Run the step-2 evals on base vs fine-tuned. Also spot-check
              for regressions on general ability and safety behavior.

6. ITERATE    Most gains come from FIXING DATA, not tuning
              hyperparameters. Look at the failures; they're usually
              inconsistencies in your own examples.

The dirty secret of the field, worth saying in interviews: fine-tuning is a data-curation discipline wearing an ML costume. The model faithfully amplifies whatever your examples teach — including their inconsistencies, biases and mistakes.

Production perspective

The classic economic win: distill a frontier model into a small one — use the big model to generate/grade training data for your narrow task, fine-tune a small model on it, serve at a tenth of the cost and latency. ("Teacher → student" distillation; this pattern powers many production classifiers, routers and extractors.)
Where it shines in real products: rigid structured output (extraction to your exact JSON schema), tone/brand voice at scale, domain dialects (legal, medical, your codebase's conventions), classification/routing at high volume.
Where teams waste months: fine-tuning to inject knowledge that belongs in RAG, fine-tuning before exhausting prompting, or fine-tuning without evals and "feeling" the result.
Serving note: LoRA adapters are swappable per-request on one hosted base model — multi-tenant fine-tunes without multi-model costs.

Common mistakes

No eval set — you cannot improve what you don't measure; vibes are not a metric.
Fine-tuning for knowledge — it'll hallucinate confidently in your brand voice. Knowledge → RAG; behavior → fine-tuning.
Inconsistent training data — 500 examples written by five people with three formats teaches the model to be confused with confidence.
Too many epochs — training loss → 0 while the model memorizes your examples verbatim and loses generality (overfitting, the same enemy as every ML system).
Skipping the regression check — your model now formats reports perfectly and somehow forgot how to refuse harmful requests; test general + safety behavior, not just the target task.
Treating it as one-and-done — behavior drifts as your product changes; budget for re-tuning like you budget for migrations.

Decision reps: for each, pick prompt/RAG/fine-tune and defend it in one sentence: (a) chatbot must answer from your company wiki; (b) extractor must emit your exact JSON schema 100% of the time; (c) model doesn't know a library released last month; (d) support bot should sound like your brand across 50 K tickets/day.
LoRA Forward Lab: Run the custom Python LoRA simulation code provided above. Modify the rank r from 2 to 4 and observe the changes in parameter counts and matrix dimensions.
Data drill: write 5 training conversations for the LandAI report style above — then critique your own consistency (format, length, tone) the way step-6 demands.
Hands-on (API path): take 50 examples of a toy task (e.g., classify commit messages as feat/fix/chore), fine-tune a small hosted model, and compare against few-shot prompting on a held-out 20. The eval table you produce is the deliverable.
Hands-on (open path): run a LoRA fine-tune of a 7B model with Hugging Face PEFT on the same data; note GPU memory with and without 4-bit quantization.

Next: LLM Evaluation & Safety — testing and defending production models.

Starting from Zero — A Physical Intuition

What fine-tuning is

The decision that matters: prompt vs RAG vs fine-tune

How it works mechanically

Full fine-tuning vs LoRA (the interview distinction)

PEFT (Parameter-Efficient Fine-Tuning) Methods Comparison

LoRA Forward Pass in Code

Dataset Format (JSONL) & Validation

Fine-Tuning Pipeline with Hugging Face (`SFTTrainer`)

Think it through like the interview

The workflow (where projects actually succeed or fail)

Production perspective

Common mistakes

Interview perspective

Interactive Quiz

Practice