Issue 02 · Fine-tuning

Fine-tuning,
step by step.

Take a pre-trained transformer. Change its weights for a specific task. This essay walks every method (SFT, LoRA, QLoRA, DPO, RLHF) with the math, the code, the parameters, and a decision guide for picking the right one.

Prerequisite (recommended): this essay updates the same weights you saw in Issue 01: Transformer, step by step. If those Q·K·V matrices and feed-forward weights don't ring a bell yet, read the transformer essay first. It shows where the weights live and what they do during a forward pass. Here, we change them.
Ground truth: every formula on this page matches its source paper. Every code snippet is a real, runnable HuggingFace API. Every default hyperparameter is a documented production value. We have hand-checked each. If you spot something wrong, the colophon has my contact. Please tell me.
Step 1

1. What is fine-tuning?

Starting from pre-trained weights, doing more training on smaller task-specific data.

What is this? A pre-trained language model (Llama, Mistral, Qwen) has been trained on trillions of tokens of internet text. It can already complete sentences and reason. Fine-tuning takes this base model and updates its weights on a much smaller dataset (hundreds to millions of examples) so it behaves the way you want: following instructions, mimicking a tone, refusing certain content, solving a specific task.

Why do we need it? The base model is a general predictor of next tokens. To get an instruction-following chatbot, a domain expert (medical, legal), or a model that prefers concise answers, you need to change the weights. Prompting can shift behavior at inference time but doesn't persist; fine-tuning bakes the change into the weights themselves.

Three ways to teach a model

Reading this: a side-by-side of the three options, sorted by how invasive they are.

Method Changes weights? Examples needed Cost Persists?
Prompting No 0 (zero-shot) or 1 to 10 (few-shot) $0 Only within one chat
Fine-tuning Yes (some or all) Hundreds to millions $10s to $1000s Yes, saved in the weights
Training from scratch All, from random init Trillions of tokens $1M+ Yes, with a 6-figure GPU bill

Fine-tuning is the middle path: cheap enough for individuals, durable enough to ship.

Step 2

2. The post-training pipeline

How a base model becomes a chatbot, in four stages.

What is this? Modern instruction-following models (ChatGPT, Claude, Gemini) are not trained in one step. After pre-training (next-token prediction on the internet), they go through a sequence of fine-tuning stages, each adding a different capability.

Why this order? Each stage assumes the previous one is done. SFT teaches format; preference learning teaches taste; reasoning RL adds chain-of-thought ability. Skip a stage and the next one struggles.

The pipeline

STAGE 1 Pre-training trillions of tokens STAGE 2 SFT 10K to 1M demos STAGE 3 DPO / RLHF 10K to 500K prefs STAGE 4 Reasoning RL verifiable rewards "finish this sentence" predict next token "answer this question" follow instructions "prefer A over B" match human taste "think before answering" long chain of thought

Stages 1 and 4 are the bookends. Most "fine-tuning" work happens in stages 2 and 3. That is the focus of this essay.

SFT teaches what to say. Preference learning teaches taste. Together, they turn a base model into an assistant.

Step 3

3. SFT: Supervised Fine-Tuning

Train on (instruction, response) pairs. Same loss as pre-training, smaller curated dataset.

What is this? Take pairs of (prompt, ideal response). Tokenize both, concatenate them into one sequence, and train the model to predict the next token at every position, exactly like pre-training, but only on these curated examples. Critically, you only compute loss on the response tokens, not the prompt: the model already saw the prompt; we want to teach it the answer.

Why do we need it? A pre-trained model "knows" Wikipedia and Reddit but doesn't know what kind of text it should produce when asked a question. SFT shows examples: when the input looks like an instruction, the output should look like a helpful answer.

The math

LSFT(θ) = - Σt ∈ response log Pθ(yt | y<t, x)
x: the prompt tokens (loss is masked here, not computed).   yt: the t-th response token.   Pθ: the model's predicted probability for that token.   θ: all model weights (or just the LoRA adapters, if using LoRA).   Loss is summed over response tokens only, prompt tokens use ignore_index = -100.

The optimizer is almost always AdamW (Adam with decoupled weight decay). One step:

Plain English first. AdamW keeps two running averages per weight: mt (average gradient direction, the first moment) and vt (average squared gradient, the second moment). The first smooths noisy gradients; the second shrinks the step for weights with large or erratic gradients. Both averages start at zero, so and divide out that initial bias. wd (weight decay) gently pulls weights toward zero each step.

mt = β1 mt-1 + (1-β1) gt
vt = β2 vt-1 + (1-β2) gt2
θt = θt-1 - lr · m̂t / (√v̂t + ε) - lr · wd · θt-1
gt: gradient at step t.   β1 = 0.9, β2 = 0.999, ε = 1e-8.   , are bias-corrected: m̂ = m / (1 - β1t).   wd: weight decay (typically 0 or 0.01).

The code

HuggingFace TRL's SFTTrainer. Real, runnable, current API.

from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model_id  = "meta-llama/Llama-3.2-3B"
model     = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset   = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

cfg = SFTConfig(
    output_dir              = "./sft-out",
    num_train_epochs        = 3,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 4,    # effective batch = 16
    learning_rate           = 2e-5,
    lr_scheduler_type       = "cosine",
    warmup_ratio            = 0.03,
    max_length              = 2048,     # TRL >=0.12: use max_length (older versions: max_seq_length)
    packing                 = False,    # incompatible with completion_only_loss
    bf16                    = True,
    logging_steps           = 10,
    save_strategy           = "epoch",
    completion_only_loss    = True,     # mask prompt tokens (prompt/completion datasets only)
)

trainer = SFTTrainer(
    model           = model,
    args            = cfg,
    train_dataset   = dataset,
    processing_class = tokenizer,
)
trainer.train()
trainer.save_model("./sft-final")

The parameters, what each one changes

In order of "tweak this first if results are bad". Production-cited ranges.

ParameterTypical rangeEffect of changing it
learning_rate2e-5 to 5e-5 (full FT)
1e-4 to 2e-4 (LoRA)
The single most important knob. Too high: model "forgets" pre-training (catastrophic forgetting); loss spikes. Too low: training never converges. LoRA tolerates higher LR because frozen base anchors the model.
num_train_epochs1 to 31 epoch usually suffices for >10K examples. More than 3 typically overfits. With small data (<1K), 5 to 10 can help.
per_device_train_batch_size
+ gradient_accumulation_steps
effective batch 16 to 128Larger effective batch = smoother gradients = more stable training. Use accumulation to simulate large batches on small GPUs (4 × 4 = effective 16).
lr_scheduler_type"cosine" or "linear"Cosine is the default for SFT. Smooth decay to ~0 prevents end-of-training instability.
warmup_ratio0.03 to 0.1Linearly ramp LR from 0 to peak over the first N% of steps. Without warmup, the vt term in the math above is unstable in the first 100 steps and loss can spike.
max_seq_length512 to 4096Longer = more context, but memory grows quadratically with attention. If most examples are short, set this small and use packing.
packingTrueConcatenate short examples into max-length sequences with separator tokens. Removes padding waste; can give 2 to 5x throughput.
completion_only_lossTrueMask prompt tokens with -100 in labels so loss is only on the response (-100 is HuggingFace's sentinel for "skip in cross-entropy loss"). Requires the dataset to have separate prompt/completion columns; for chat-formatted (messages) datasets use assistant_only_loss=True instead.
bf16 / fp16bf16 if availablebfloat16 has the same exponent range as float32 so doesn't need loss scaling; fp16 is faster on older GPUs but needs care. CPU/MPS: keep fp32.

Live: SFT in your browser

A tiny 1→16→1 MLP regresses y = sin(2π x) on 32 noisy points, the simplest non-linear regression that needs a hidden layer. Click "Train" to run 80 deterministic SGD steps. Move the sliders to see how each parameter shapes the loss curve.

SFT is just supervised next-token learning, masked to the response. Every other method below is a variation on this theme.

Step 4

4. LoRA: Low-Rank Adaptation

Don't train the whole weight matrix. Train two skinny matrices that, multiplied, equal the change you want.

What is this? A pre-trained weight matrix W0 has shape d×k: for a 7B model's attention projection, that's 4096×4096 = 16M parameters. LoRA freezes W0 and learns a small delta in two skinny matrices: B (shape d×r) and A (shape r×k), with r << min(d,k): typically r = 8 to 64. The effective layer becomes W0 + (α/r) · BA.

Why does it work? Aghajanyan et al. (2020) showed that fine-tuning updates have low intrinsic rank: most of the change can be expressed in a handful of dimensions. Hu et al. (2021), the LoRA paper, exploited this. Empirically, r = 8 often matches full fine-tuning quality.

Why does anyone care? For a 7B-parameter model, full fine-tuning trains 7B params (28 GB of optimizer state alone). LoRA with r = 16 trains 0.1 to 1% of params. Same quality, fraction of the cost.

The math

h = W0 x + ΔW · x     where   ΔW = (α/r) · B A
B ∈ ℝd × r     A ∈ ℝr × k     r << min(d, k)
W0: pretrained weights, FROZEN.   B: initialized to all zeros so the initial output equals base.   A: initialized Kaiming uniform / Gaussian.   r: rank, the hyperparameter.   α: scaling factor; effective scaling is α/r, so doubling r without doubling alpha halves the update magnitude.   Forward cost: one extra small matmul. Inference cost (after merging ΔW into W0): zero.

Parameter count math

Concrete: a single 4096×4096 projection layer.

SetupParams trainedReduction vs full
Full FT4096 × 4096 = 16,777,2161× (baseline)
LoRA, r = 84096×8 + 8×4096 = 65,536256× smaller (0.39%)
LoRA, r = 164096×16 + 16×4096 = 131,072128× smaller (0.78%)
LoRA, r = 644096×64 + 64×4096 = 524,28832× smaller (3.1%)

The code

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B")

lora = LoraConfig(
    task_type     = TaskType.CAUSAL_LM,
    r             = 16,                       # rank
    lora_alpha    = 32,                       # often 2 * r
    lora_dropout  = 0.05,
    bias          = "none",
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],
    # alternative: target_modules = "all-linear"  for max coverage
)

model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 6,815,744 || all params: 3,219,562,240 || trainable%: 0.21
# now plug `model` into SFTTrainer exactly as before

Parameters, what each does

ParameterRangeEffect
r (rank)4, 64Capacity of the adapter. r = 8 is a common starting point. Higher r = more expressive update + more params + more memory. Empirically: r > 16 rarely helps once data is moderate.
lora_alpha8, 128 (commonly 2·r)Scaling. Effective update magnitude is alpha / r. Convention: alpha = 2 * r so update magnitude stays consistent across r changes. Some practitioners (rsLoRA) use alpha / sqrt(r) for stability at very high r.
target_modulessee notesWhich linear layers get LoRA. Minimum: ["q_proj", "v_proj"] (cheapest, often enough). Common: all four attention projections. Max: "all-linear" (also includes MLP). More targets = more params, generally better quality.
lora_dropout0, 0.1Dropout on the LoRA branch. 0.05 to 0.1 helps prevent overfitting on small datasets. 0 if you have lots of data.
bias"none" / "all" / "lora_only"Whether to also train bias terms. Usually "none" (no extra params, no quality loss).
modules_to_saveoften ["embed_tokens", "lm_head"]Modules to fully unfreeze (not adapt with LoRA). Used when extending the vocabulary, new embedding rows must train fully.

Live: watch a LoRA adapter form

A 16×16 frozen weight W0 is shown as a heatmap. Move the rank slider to see B and A grow / shrink. The product BA is the proposed delta. W0 + (α/r) · BA is what the model uses.

LoRA assumes the change you want is low-rank. For most fine-tuning tasks, that assumption holds, and you save 100× on memory.

Step 5

5. QLoRA: LoRA on a quantized base

4-bit the frozen base model, train LoRA adapters in 16-bit. Fit a 70B model on one GPU.

What is this? A 70B model in BF16 weighs ~140 GB (FP32 would be 280 GB), too big for any single consumer GPU. QLoRA (Dettmers et al. 2023) compresses each weight to 4 bits using a custom 16-bucket lookup table called NF4, holds the base in 4-bit memory, dequantizes blocks on-the-fly during forward pass, and keeps LoRA adapters in 16-bit so the gradient is precise. Result: a 70B model fits on one 48 GB A100. A 7B model fits on a 16 GB consumer GPU.

Quantization in plain English. Take the original 32-bit weight, find the closest of 16 fixed values (the "buckets"), and store only the 4-bit index of that bucket. NF4 is information-theoretically optimal under the assumption that, after absmax normalization (dividing every weight in a small group by the largest |w| in that group), the weights are zero-mean Gaussian on [-1, 1]. Block-wise scaling: chop the matrix into chunks of 64 weights and rescale each chunk independently so a single huge outlier doesn't ruin the bucket spacing for everyone. Double quantization: the per-block scale factors are themselves quantized, in second-level blocks of 256, saving an extra ~0.37 bit per parameter.

Why does it not destroy quality? The three tricks above stack: NF4 fits the actual weight distribution; block-wise scaling localizes outliers; double quantization claws back overhead. Empirically QLoRA matches LoRA quality on a wide range of benchmarks.

The math

For each block of 64 weights: scale = max(|w|), ŵ = w / scale
Q(ŵ) = bucket_value, closest of 16 NF4 buckets to ŵ
w̃ = scale · Q(ŵ)     (the recovered, dequantized weight)
The 16 NF4 buckets are FIXED constants from the QLoRA paper, designed so that for ŵ ~ N(0, 1) the buckets are equally probable. They are not uniform, not symmetric in spacing, but optimal for this distribution.

The 16 NF4 bucket values (verbatim from the QLoRA paper / bitsandbytes)

Notice the spacing: tighter near 0 (where most weight values live), looser at the tails.

Memory: how much you save

Sizes for loading a Llama-7B base model. Quantization shrinks the resident weight footprint; LoRA shrinks the trainable param + optimizer state.

The code

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit              = True,
    bnb_4bit_quant_type       = "nf4",        # vs "fp4"; nf4 is better for normal weights
    bnb_4bit_use_double_quant = True,         # extra ~0.37 bit/param savings
    bnb_4bit_compute_dtype    = torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config = bnb,
    device_map          = "auto",
)
model = prepare_model_for_kbit_training(model)   # freezes base, casts norms to fp32

lora = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear",
                  lora_dropout=0.05, task_type="CAUSAL_LM")
model = get_peft_model(model, lora)

# now use SFTTrainer / DPOTrainer as before, only the LoRA adapters train

Parameters

ParameterDefaultEffect
load_in_4bitTrueEnable 4-bit quantization. The headline feature.
bnb_4bit_quant_type"nf4"NF4 (NormalFloat4) is optimized for normally-distributed weights. "fp4" is uniform float-4, usually slightly worse quality.
bnb_4bit_use_double_quantTrueQuantize the per-block scaling constants too. Saves ~0.37 bits/param. Free quality.
bnb_4bit_compute_dtypetorch.bfloat16The dtype dequantized weights are cast to during the forward pass. bf16 is the default; fp16 works but watch for overflow.
paged optimizerspaged_adamw_8bitUse bitsandbytes' paged optimizer (optim="paged_adamw_8bit" in TrainingArguments) to swap optimizer state to CPU when GPU memory pressures, prevents OOM during training spikes.

QLoRA is just LoRA with a quantized base. The quality loss is small; the memory savings are huge. For most users with one GPU, this is the default.

Step 6

6. DPO: Direct Preference Optimization

Given pairs of (chosen, rejected), nudge the model toward chosen and away from rejected. No reward model needed.

What is this? SFT teaches the model to imitate good responses. DPO (Rafailov et al. 2023) teaches it to prefer one response over another. You give it triples (prompt, chosen, rejected); it adjusts weights so the chosen response becomes more likely than the rejected one. Mathematically equivalent to RLHF, but no reward model and no PPO.

Why use it after SFT? SFT lacks a way to express "this answer is OK but this other answer is better". Preference data fills that gap. DPO is now the standard alignment recipe: SFT first, then DPO on a few thousand preference pairs, done.

The reference model. DPO needs a frozen "reference" copy of the model, usually the SFT checkpoint. The loss is computed in terms of log-probability differences from this reference. Without a reference, the model can drift arbitrarily.

The math

Notation primer. log π(y|x) is the model's log-probability of producing response y given prompt x: at each token the model predicts a probability; we take logs (multiplications become additions) and sum across the response. Log-probs are always ≤ 0; closer to 0 = more likely. σ is the sigmoid 1/(1+e-z), squashing any real number into (0,1). The Bradley-Terry model derives a preference probability from a score difference (think Elo). KL (Kullback-Leibler divergence) measures distance between two probability distributions; β controls how much the trained model can drift from the reference.

LDPO(θ) = - log σ ( β · Δ )
Δ = ( log πθ(yw|x) - log πref(yw|x) ) - ( log πθ(yl|x) - log πref(yl|x) )
x: prompt; yw: chosen ("winner") response; yl: rejected ("loser") response.   πθ: the model being trained; πref: the frozen reference (= SFT checkpoint).   log π(y|x): sum of log-probs over response tokens.   β: KL strength; controls how aggressively to drift from the reference.   σ: sigmoid; the loss is the negative log-likelihood of the Bradley-Terry preference model.

Intuitively: Δ is "how much more does the current model prefer chosen-over-rejected, compared to the reference". If Δ > 0, we're already there; if Δ < 0, we're worse than the reference. The loss penalizes Δ << 0 sharply (sigmoid).

The code

from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# both: the SFT checkpoint
model     = AutoModelForCausalLM.from_pretrained("./sft-final", torch_dtype="bfloat16")
ref_model = AutoModelForCausalLM.from_pretrained("./sft-final", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("./sft-final")

# preference dataset format: {prompt, chosen, rejected}
dataset = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")

cfg = DPOConfig(
    output_dir          = "./dpo-out",
    num_train_epochs    = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 8,
    learning_rate       = 5e-7,           # MUCH smaller than SFT
    lr_scheduler_type   = "cosine",
    beta                = 0.1,            # KL strength; the key DPO hyperparameter
    loss_type           = "sigmoid",      # standard DPO; "ipo", "hinge", "kto" are variants
    max_length          = 1024,
    max_prompt_length   = 512,
    bf16                = True,
)

trainer = DPOTrainer(
    model           = model,
    ref_model       = ref_model,
    args            = cfg,
    train_dataset   = dataset,
    processing_class = tokenizer,
)
trainer.train()

Parameters

ParameterRangeEffect
beta0.01, 0.5 (default 0.1)Most important DPO hyperparameter. Higher beta = sharper preference enforcement, but model drifts further from reference (forgetting). Lower beta = softer push, safer but slower learning. Try 0.1 first.
learning_rate5e-7, 5e-6Much lower than SFT. DPO is sensitive: a 1e-5 LR often wrecks the model. Start at 5e-7.
loss_typesigmoid / ipo / hinge / robust"sigmoid" is the original DPO. "ipo" is more conservative (no over-fit on easy preferences). "robust" is noise-tolerant for label-flipped data. (KTO has its own KTOTrainer for unpaired single-sample preferences.)
num_train_epochs1, 3Usually 1. DPO over-trains quickly, if the chosen response is now 10x more likely than rejected, more training risks regression.
max_prompt_length~half of max_lengthBound on prompt portion. Avoids long prompts swallowing the response budget.

Live: a single DPO step on a tiny preference pair

Watch the loss change as you move the β slider. Toy log-probs are deterministic so you can compute the result yourself.

DPO converts pairwise preferences directly into weight updates, the simpler, almost-always-better successor to RLHF.

Step 7

7. PPO & RLHF, briefly

The classic alignment recipe. Largely replaced by DPO; included for context.

What is this? RLHF (Reinforcement Learning from Human Feedback) was the original recipe behind InstructGPT and the early ChatGPT. Three stages: (1) SFT; (2) train a separate reward model on preference pairs; (3) use PPO to optimize the policy against the reward model, with a KL penalty pulling toward the SFT reference.

Why bother knowing it? Some teams still use it, especially when reward signals are sparse or heterogeneous. And reasoning RL methods (GRPO, RFT) inherit its structure.

Why DPO replaced most uses. No reward model to train and maintain. No PPO instability. Far simpler infrastructure. Empirical performance is comparable on most benchmarks.

The PPO math (clipped surrogate objective)

RL primer first. The policy π is the model, at each step it samples an action at (the next token) from a state st (prompt + tokens so far). After the response, a reward R is scored by the reward model. The value V(st) is a separate small network estimating expected future reward; subtracting it gives the advantage, how much better the action taken was than the policy's average. The importance ratio rt = (new prob)/(old prob) measures how far the policy moved; clipping prevents drift in any one step. γ (discount) and λ (GAE smoothing) propagate later rewards back to earlier tokens.

LCLIP(θ) = E [ min ( rt(θ) Ât,   clip(rt, 1-ε, 1+ε) Ât ) ]
rt(θ) = πθ(at|st) / πθold(at|st)
δt = rt + γ · V(st+1) - V(st)     (TD residual)
t = Σl=0T-t (γλ)l δt+l     (GAE)
rt: importance ratio (new policy / old policy).   t: advantage estimate via Generalized Advantage Estimation.   ε: clip threshold (typically 0.2); prevents the policy from changing too much per step.   γ: discount (typically 1.0 in RLHF); λ: GAE smoothing (typically 0.95).

Where the KL penalty actually goes. In RLHF the per-token reward is rt = RRM · 1[t = last] - βKL · log( πθ(at|st) / πref(at|st) ): the reward-model score is paid at the final token, and a per-token KL penalty pulls every step back toward the reference. The KL enters via the reward, not the advantage.

The code (sketch. PPO is significantly more complex to set up)

from trl import PPOTrainer, PPOConfig
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer

policy        = AutoModelForCausalLM.from_pretrained("./sft-final")
ref_policy    = AutoModelForCausalLM.from_pretrained("./sft-final")        # frozen
# reward_model and value_model both have a scalar regression head (num_labels=1),
# not a causal LM head. TRL's PPOTrainer expects sequence-classification models here.
reward_model  = AutoModelForSequenceClassification.from_pretrained("./reward-model", num_labels=1)
value_model   = AutoModelForSequenceClassification.from_pretrained("./sft-final",    num_labels=1)

cfg = PPOConfig(
    learning_rate     = 1e-5,
    cliprange         = 0.2,
    cliprange_value   = 0.2,
    kl_coef           = 0.05,
    gamma             = 1.0,
    lam               = 0.95,
    num_ppo_epochs    = 4,
    per_device_train_batch_size = 8,
    mini_batch_size   = 2,
    num_mini_batches  = 4,
)
# Note: PPOTrainer's API has changed across TRL versions; the snippet above is
# illustrative, consult the TRL docs for the exact arg names in your version.

trainer = PPOTrainer(cfg, processing_class=tokenizer,
                     model=policy, ref_model=ref_policy,
                     reward_model=reward_model, value_model=value_model,
                     train_dataset=dataset)
trainer.train()

PPO/RLHF works and is real, but the operational overhead is high. Reach for it only if DPO underperforms on your task.

Step 8

8. Catastrophic forgetting

The model gets better at your task, worse at everything else.

What is this? When you fine-tune on a narrow distribution, the model "forgets" capabilities it had pre-trained. Train Llama on medical Q&A and it gets better at medicine, worse at writing poetry, math, foreign languages. The phenomenon is called catastrophic forgetting; the more the weights move, the worse it gets.

Why does it happen? Gradient descent optimizes the training loss, period. It has no incentive to preserve capabilities that aren't in the training distribution. Pre-training accumulated those capabilities by sheer data diversity; SFT on a narrow set actively erodes them.

Mitigations, in increasing order of effectiveness

MitigationHow it helps
Lower learning rateSmaller weight updates = less drift = less forgetting. Cheapest fix; first thing to try.
Fewer epochsStop before overfitting. 1 epoch on a moderate dataset is often enough.
Replay bufferMix in 5 to 20% pre-training-style data with your fine-tuning data. Keeps the model exposed to the original distribution.
KL regularizationAdd a term penalizing divergence from the base model. DPO and PPO have this built in via the reference model.
LoRABest mitigation by far. Base weights are frozen; only adapters change. Worst case, you can disable the adapter and recover the base model exactly.

Live: forgetting curve, full FT vs LoRA

Synthetic illustration: as training proceeds, both approaches improve task accuracy. Full FT loses general accuracy; LoRA preserves it.

If you don't have a reason to do full fine-tuning, do LoRA. The base model stays intact.

Step 9

9. Which method should I use?

An interactive decision tree. Answer 3-5 questions, get a concrete recipe.

What is this? A flowchart for picking a fine-tuning method. Click your answer to each question; it advances to the next. The leaves are concrete recipes (method + hyperparameters + code template).

The combinations matter. SFT alone gives you instruction-following. SFT then DPO gives you alignment. SFT then DPO with QLoRA fits on consumer hardware.

Step 10

10. Method comparison matrix

All seven methods, side by side. Concrete numbers, not vibes.

Most production teams in 2026 ship with: QLoRA + SFT + DPO. That's the canonical recipe.

Step 11

11. Common mistakes

Things that look fine until they don't.

MistakeWhat goes wrongFix
Wrong learning rateToo high: loss spikes, weights diverge, model forgets pre-training. Too low: loss never moves.Start at 2e-5 (full FT) or 2e-4 (LoRA). Bisect from there.
No validation setYou'll overfit and not notice until the model is shipped and bad.Hold out 5 to 10% of data. Eval every N steps. Stop on val plateau.
Train on test contaminationInflated benchmarks; user-facing failures.Verify dataset splits are disjoint by hash. Especially with public datasets.
Wrong chat templateModel produces garbage at inference because special tokens / role markers don't match training.Always use tokenizer.apply_chat_template(...) at both train and inference time.
Prompt loss not maskedModel wastes capacity learning to predict the prompt instead of the response.Set completion_only_loss=True in SFTConfig (or use DataCollatorForCompletionOnlyLM).
DPO without an SFT baseReference model = base, far from useful behavior. DPO drifts unpredictably.Always SFT first. DPO is a polish, not a starter.
No LoRA on the embedding when extending vocabNew tokens have random embeddings the LoRA can't fix.Add modules_to_save=["embed_tokens", "lm_head"] to LoraConfig.
Eval too rarelyYou don't notice catastrophic forgetting until it's deep.Eval every 100 or 500 steps on a held-out general benchmark (MMLU, GSM8K).
Saving full optimizer state with QLoRADefeats the memory savings.Use optim="paged_adamw_8bit"; save only the adapter, not the full state dict.
Forgetting bf16 / fp16Fine-tuning a 7B model in fp32 OOMs even on big GPUs.torch_dtype=torch.bfloat16 at load; bf16=True in TrainingArguments.

Half of fine-tuning bugs are formatting bugs. The rest are LR bugs. Eval often.

Step 12

12. End-to-end recipe: QLoRA SFT, then DPO

A real, copy-pasteable script for the canonical 2026 recipe.

This is what most production teams ship. Llama-3.1-8B base, QLoRA 4-bit, SFT on instructions, DPO on preferences. Runs on a single 24 GB GPU.

# requirements: transformers, trl, peft, bitsandbytes, datasets, accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import SFTTrainer, SFTConfig, DPOTrainer, DPOConfig
from datasets import load_dataset

BASE = "meta-llama/Llama-3.1-8B"

# === Step 1: shared 4-bit base loader ===
def load_4bit_base():
    bnb = BitsAndBytesConfig(
        load_in_4bit              = True,
        bnb_4bit_quant_type       = "nf4",
        bnb_4bit_use_double_quant = True,
        bnb_4bit_compute_dtype    = torch.bfloat16,
    )
    model = AutoModelForCausalLM.from_pretrained(BASE,
        quantization_config = bnb, device_map = "auto")
    model = prepare_model_for_kbit_training(model)
    return model

tokenizer = AutoTokenizer.from_pretrained(BASE)
tokenizer.pad_token = tokenizer.eos_token

# === Step 2: SFT on instructions ===
sft_model = load_4bit_base()
sft_model = get_peft_model(sft_model, LoraConfig(
    r              = 16,
    lora_alpha     = 32,
    lora_dropout   = 0.05,
    target_modules = "all-linear",
    task_type      = "CAUSAL_LM",
))

sft_data = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

sft_trainer = SFTTrainer(
    model = sft_model,
    args  = SFTConfig(
        output_dir                  = "./sft",
        num_train_epochs            = 1,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        learning_rate               = 2e-4,
        lr_scheduler_type           = "cosine",
        warmup_ratio                = 0.03,
        max_length                  = 2048,    # TRL >=0.12: max_length (older: max_seq_length)
        packing                     = False,   # incompatible with completion_only_loss
        bf16                        = True,
        completion_only_loss        = True,
        optim                       = "paged_adamw_8bit",
        logging_steps               = 10,
        save_strategy               = "epoch",
    ),
    train_dataset    = sft_data,
    processing_class = tokenizer,
)
sft_trainer.train()
sft_trainer.save_model("./sft-adapter")     # adapter only; ~50 MB

# === Step 3: DPO on preferences ===
# Reload base + SFT adapter as the trainable model. DPO continues updating the
# same LoRA matrices SFT trained, not a fresh adapter.
# Alt for clean separation:
#   dpo_model = dpo_model.merge_and_unload()
#   dpo_model = get_peft_model(dpo_model, fresh_lora_cfg)
dpo_model = load_4bit_base()
dpo_model = PeftModel.from_pretrained(dpo_model, "./sft-adapter", is_trainable=True)
# ref_model=None: with PEFT, TRL disables the adapter to compute reference
# log-probs, saves the entire base model in memory.

dpo_data = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")

dpo_trainer = DPOTrainer(
    model     = dpo_model,
    ref_model = None,
    args      = DPOConfig(
        output_dir                  = "./dpo",
        num_train_epochs            = 1,
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,
        learning_rate               = 5e-7,
        lr_scheduler_type           = "cosine",
        beta                        = 0.1,
        loss_type                   = "sigmoid",
        max_length                  = 1024,
        max_prompt_length           = 512,
        bf16                        = True,
        optim                       = "paged_adamw_8bit",
    ),
    train_dataset    = dpo_data,
    processing_class = tokenizer,
)
dpo_trainer.train()
dpo_trainer.save_model("./dpo-adapter")

~80 lines. Two adapters saved. Total cost: a few hours on a single GPU. Output: a model that follows instructions and matches preferences.

Step 13

13. What we left out

Honest list of techniques worth knowing about.

  • GRPO / RFT. Reasoning RL with group-relative advantages and verifiable rewards. The recipe behind o1, R1, and modern reasoning models.
  • DoRA. Decomposes LoRA's update into magnitude + direction; small quality bump over plain LoRA.
  • IA3. Even smaller than LoRA: just learn per-layer scaling vectors. Lighter, lower ceiling.
  • Prompt tuning / prefix tuning. Train soft prompt embeddings, no weight changes. Cheaper than LoRA, weaker quality. Useful when you can't touch model weights at all (closed APIs).
  • Adapter layers. Insert small bottleneck modules between transformer layers. Pre-LoRA approach; LoRA largely replaced it.
  • Distillation. Train a small "student" model to mimic a larger "teacher". Different goal: shrinking, not specializing.
  • Continued pre-training. Run the pre-training objective on more domain data before any SFT. Helps when the domain is far from the base model's training distribution.
  • RLHF variations. KTO (single-sample preferences), IPO (more conservative DPO), SimPO (no reference model needed), ORPO (combines SFT and preference loss in one step).
  • Spectrum. Selectively unfreeze high-signal layers (per the SNR of their gradients). Middle ground between LoRA and full FT.
  • Curriculum learning. Order training data by difficulty. Helps in some specialized settings (math, code), hurts in general fine-tuning.
  • GaLore. Project gradients onto a low-rank subspace before the optimizer step, cuts AdamW state by ~80% while still updating the full weights (unlike LoRA).
  • LISA. Layer-wise Importance Sampled AdamW: at each step randomly unfreeze only a few layers. LoRA-class memory, full-FT-class quality on long sequences.
  • PiSSA / OLoRA. Initialize LoRA adapters from the principal singular vectors of W0 (PiSSA) or via QR decomposition (OLoRA) instead of zeros, faster convergence, stronger final quality.
  • LongLoRA / MoRA. LongLoRA combines LoRA with shifted-sparse attention to extend context length cheaply; MoRA uses a square high-rank update matrix that beats LoRA on memory-heavy tasks.
  • unsloth. Drop-in optimized kernels and a friendlier API on top of HuggingFace + bitsandbytes, ~2x faster QLoRA training with lower VRAM, no quality change.

The architecture you saw in Issue 01 is what gets fine-tuned. The methods on this page are how. The methods above are variations and extensions worth knowing exist.

Five canonical methods cover 95% of real fine-tuning. The list above is what the other 5% looks like.