Chapter 03

Fine-tuning

When the base model is half-blind to your network

My network's vocab isn't in any model.

LoRA-fine-tune DistilBERT on a synthetic 5-class log-intent dataset; watch accuracy climb from random to 90%+.

LoRADistilBERTHuggingFace

The pain you have already met

You finished chapter 02 with a working embedding model that found synonyms across log lines. You also met the failure mode that this chapter exists to solve: the model is half-blind to your network's vocabulary. Phrases your team says every day — "blackholing the VIP," "draining the edge," "WRED on the egress queue" — sit nowhere meaningful in the model's geometry. The general embeddings treat them as exotic noise. Cosine similarity between "drain the edge" and "empty the bucket" might come back at 0.4 when in your team's vocabulary they refer to the same operational move.

You finished chapter 02 with one option: live with it. This chapter is the second option: bend the model.

The question this chapter answers: given a base model that was trained on the internet, how do I make it speak my team's language, on my hardware budget?

We will not train a model from scratch. Training from scratch is the GPU-readiness pain Cisco's AI Readiness Index quantifies — most enterprises do not have the power, cooling, or capital. We will fine-tune. Fine-tuning is the affordable cousin of pre-training and it works because of one trick called LoRA.

The GPU readiness gap, in concrete numbers

Cisco's 2024 AI Readiness Index reported that only 13% of companies were fully ready to deploy AI at scale, with infrastructure cited as the leading blocker. The Reddit and HN threads under that survey were full of network engineers saying the same thing in different words: the data center I run cannot power the GPUs my company just bought. The 8x H100 box you have read about draws roughly 10 kW continuous. A standard rack does 6-15 kW. Cooling is liquid, not air. The PDU is a 60-amp circuit, not a 30. Most colos cannot accept these boxes without electrical rework.

This matters for the chapter you are reading right now because it shapes what you can realistically build. Three brackets are useful to keep in mind:

The internet-scale pre-training bracket is what builds GPT-5 or Claude Opus. Cost: hundreds of millions of dollars, thousands of H100s, months. You will not do this. Nobody at the network-engineer-learning-AI level does this. The economics work for maybe ten companies on Earth.

The full fine-tuning bracket takes the published weights of a smaller open model — Llama-3-8B, Qwen-2.5-7B — and re-trains every parameter on your data. Cost: thousands to tens of thousands of dollars per run, an 8x A100 box for hours to days, requires dataset hygiene and ML engineering. Some enterprises do this. Most don't, because the next bracket exists.

The LoRA bracket is the bracket you actually live in. LoRA — Low-Rank Adaptation — freezes the base model and inserts small adapter matrices that get trained. You modify a few million parameters instead of a few billion. A LoRA on Llama-3-8B fits in 50-100 MB. It trains on a single consumer GPU (T4, 3090, 4090) in minutes to a couple of hours. It runs on free Colab. This is what you will do in the notebook.

The point of this chapter is: stop assuming "fine-tuning" requires the H100 cluster you don't have. The version of fine-tuning you actually need fits on hardware you already own — or on a free Colab session.

Three reasons to fine-tune at all

Before we get to LoRA, decide whether you should fine-tune in the first place. There are three legitimate reasons. Most beginner mistakes happen when people fine-tune for the wrong reason.

Reason one: vocabulary. Your domain has terms the base model has not seen often. Network protocols, vendor commands, product codes, internal hostnames — the rare tokens of your world. Fine-tuning shifts the model's embeddings of these terms so that, for example, BUM-traffic and broadcast-unknown-multicast end up close together. The model learns your dictionary. This is the cheapest, most common, most legitimate use of fine-tuning.

Reason two: style. You want outputs to follow a specific format — a particular runbook structure, a JSON schema for incident summaries, a Markdown layout for postmortems. The base model can be prompted to do this once but drifts over a long conversation. Fine-tuning lets you bake the format into the model's defaults. A 200-example dataset of (input → desired-format-output) pairs is usually enough.

Reason three: task. You want the model to perform a specific operation it does poorly out of the box. Classify a log line into one of five intent categories. Score a config diff for blast radius. Suggest the next command in a troubleshooting session. These are tasks where general fluency does not buy you reliability — the model needs to learn what good looks like for your task.

Three things fine-tuning is not for. It is not for adding knowledge — for that, use RAG (chapter 04), which is faster, cheaper, and easier to update. It is not for fixing safety problems — for that, you use guardrails at inference time, not training. It is not for making the model "better in general" — that's mostly imagined value, hard to measure, and you almost always regress on something else when you train.

Get the reason right before you spend hours on the dataset. Most teams that "tried fine-tuning and it didn't help" were fine-tuning for the wrong reason.

SFT: the simplest fine-tune you will ever do

Supervised fine-tuning is exactly what it sounds like. You build a dataset of (input, ideal_output) pairs. You show the model thousands of examples. The model's parameters shift to make ideal_output more likely whenever it sees something like input.

Here is the entire pseudocode of one training step:

1. Sample a batch of (input, ideal_output) pairs
2. Run the model on input. Get its predicted probability for each token in ideal_output.
3. Compute the loss: how wrong was the model, summed across all tokens?
4. Backpropagate: figure out which weights pushed the loss up, and which down.
5. Adjust those weights slightly in the direction that reduces loss.
6. Repeat for thousands of batches.

Everything else — learning rate schedules, gradient accumulation, mixed precision — is engineering polish that makes the above run faster or more stably. The math is one paragraph.

For the network engineer, the leverage is your dataset. The model will only learn what you show it. Two hundred high-quality, varied examples will outperform two thousand repetitive ones. If your dataset has the same phrasing in every example, the model will overfit to that phrasing. If your dataset has labeling inconsistencies — same input labeled two different ways by two different people — the model averages them, which is usually worse than either pure version. Dataset quality is the entire game. Treat dataset construction with the same care you would treat ACL design.

The notebook will walk you through a small SFT example: 100 synthetic log lines labeled into five intent categories. By the end, your tiny LoRA-on-DistilBERT classifier will outperform the base model. The numbers are not the point — the workflow is the point. The workflow is identical when you scale to 10,000 lines of real production data.

LoRA: the trick that made fine-tuning democratic

The full version of fine-tuning would adjust every weight in the model. A 7-billion-parameter model has 7 billion weights. Adjusting all of them needs to track gradients on all of them, which roughly doubles the memory footprint, which is why people talk about H100s with 80 GB of VRAM.

The LoRA insight, from a 2021 Microsoft paper, was this: the change you actually need is low-rank. When you fine-tune for vocabulary or style or a narrow task, you are not building a different model. You are nudging the existing model in a constrained direction. That nudge can be represented as a product of two skinny matrices — a down-projection and an up-projection — that together have a fraction of the parameters of the full weight matrix they sit alongside.

Concretely: instead of changing a 4096x4096 matrix (16.7 million parameters), you train a 4096x8 matrix and an 8x4096 matrix (65,000 parameters total). Same effect on outputs, 250x fewer parameters to optimize. At inference time, you can either keep the adapters separate (so you can swap them — vocabulary-A in the morning, vocabulary-B in the afternoon) or merge them back into the original weights (zero inference overhead, but lose the swap-ability).

The practical consequence: a LoRA training run on a small open model — DistilBERT for classification, Llama-3-8B or Qwen for generation — fits on a single T4 GPU. Free Colab gives you a T4. You can do this today, on hardware that costs zero, with code that fits on a screen. The notebook will demonstrate it.

This is the technique that closed the gap between "AI is for FAANG" and "I can run this on my laptop." Most of the open-source progress in 2024-2026 has been built on LoRA and its variants (QLoRA, which quantizes the base model further, and DoRA, which decomposes the LoRA matrices differently). When you read "fine-tuned open-source model on consumer hardware" in 2026, you are almost always reading about LoRA.

DPO and the post-SFT phase

SFT teaches the model to imitate good outputs. It does not teach the model to prefer good outputs over bad ones. Two examples:

You SFT a model to write incident postmortems. The model now writes postmortems. Some of them are factually wrong in subtle ways. The model has no signal that distinguishes "correct postmortem" from "fluent but wrong postmortem" — it only saw the correct ones.

You SFT a model to suggest next-command in a troubleshooting flow. The model now suggests commands. Some are dangerous on production. The model has no signal that distinguishes "safe suggestion" from "wipes the routing table" — both were valid commands in some context, and your dataset cannot show every dangerous edge case.

The fix is preference learning: show the model pairs of outputs and tell it which one is better. RLHF — Reinforcement Learning from Human Feedback — was the first famous version of this trick, used to train ChatGPT. RLHF is heavyweight: you train a separate reward model, then use reinforcement learning to train the language model against the reward model. Two models, two training runs, lots of engineering.

DPO — Direct Preference Optimization — is the simpler 2023 alternative that replaced RLHF in most production stacks. DPO collapses the two stages into one: given a pair (preferred_output, rejected_output), it directly adjusts the model's weights to make the preferred output more likely and the rejected output less likely. Same effect as RLHF, one model, one training run, far easier to debug.

In 2024-2026 the landscape continued to simplify. GRPO (Group Relative Policy Optimization), RLVR (Reinforcement Learning with Verifiable Rewards) — these are refinements that matter for production teams but not for the network engineer learning the field. You need to know what DPO does and when you'd reach for it. The variants are research-grade details you can absorb later.

For most network engineering use cases, you will not need DPO. SFT plus careful dataset curation gets you 90% of the way. Reach for DPO when SFT has stopped improving and you can articulate a specific preference the model is failing to learn — for example, "the model produces correct configs but they're not idempotent; safe ones (using no first) should be preferred."

The decision tree: when do I fine-tune?

The hardest part of this chapter is not the training code. It is the decision before training. Here is the tree I use.

Step 1. Can a better prompt solve this? A clear system prompt with three examples will beat half of the fine-tuning runs people attempt. If you have not seriously prompt-engineered, do that first. Time investment: an afternoon. Cost: zero.

Step 2. Is this a knowledge problem (the model needs to know specific facts about your environment) or a vocabulary/style/task problem (the model needs to behave a certain way)? If knowledge, skip to RAG (chapter 04). RAG updates instantly when your data changes; fine-tuning requires a retraining run. For network operations where configs change daily, RAG almost always wins on the knowledge half.

Step 3. If it is vocabulary/style/task, do you have at least 100 high-quality examples? If no, build the dataset first. Do not start training with a thin dataset; you will overfit and conclude (wrongly) that fine-tuning doesn't work. The data is the model.

Step 4. Pick the smallest model that could plausibly solve your task. For classification: DistilBERT or RoBERTa. For generation: a 7-billion-parameter open model. For specialized tasks like log parsing: a 1-3-billion-parameter small model with a LoRA. Bigger is not better for fine-tuning — bigger models overfit faster on small datasets and cost more per inference.

Step 5. Run a baseline. Evaluate the base model on your test set before you fine-tune. Write down the number. Half the time the baseline is already good enough and you can ship without training anything. The other half the time, the gap between baseline and target tells you what to aim for.

Step 6. Train a LoRA. Evaluate against the same test set. Iterate.

The trap most people fall into is jumping straight from "I should use AI for X" to "I should fine-tune for X." Three out of four times, prompt engineering or RAG is the right answer. Fine-tune when you have a defensible reason — vocabulary, style, task — and not before.

Failure modes you will meet

Fine-tuning has well-known failure modes. The notebook will not let you experience all of them, but you should know their names.

Catastrophic forgetting. A model fine-tuned hard on a narrow task forgets unrelated abilities. Train a Llama-3 too aggressively on network log classification and it may lose general reasoning. Mitigation: shorter training (fewer epochs), lower learning rate, mix some general examples into your dataset. LoRA partially mitigates this because the base weights stay frozen.

Overfitting. Training too long on too small a dataset. The model memorizes the training set rather than generalizing. Symptom: training loss keeps going down, evaluation loss starts going up. Mitigation: stop training when eval loss plateaus, use early stopping, expand the dataset.

Distribution shift. You fine-tune on yesterday's data, deploy today, and the world has moved. New product launches, new naming conventions, new attack patterns. The model is stale. Mitigation: retraining cadence (monthly is typical), or pair fine-tuning with RAG so the dynamic facts live in retrieval and the static patterns live in weights.

Reward hacking (DPO/RLHF specific). The model finds a cheap way to score high on your preference signal that has nothing to do with being actually better. Famous example: a chatbot trained on "be helpful" learned to be sycophantic. Mitigation: diverse preference data, sanity checks against an unrelated eval set, humans in the loop on a sample of outputs.

Cost surprise. A 4-hour training run on an A100 in cloud costs roughly $4. A weeklong run on 8 A100s costs roughly $5,000. The math compounds quickly. Mitigation: rigorous baselines (do you actually need 8 A100s for a week?), LoRA before full fine-tune, the smallest viable model.

None of these are dealbreakers. All of them are easier to navigate when you have done a small training run end-to-end first. That run is the notebook for this chapter.

What the notebook will give you

The notebook builds a five-class intent classifier for network log lines. Categories: link-state, dhcp, auth, routing, anomaly. We synthesize 100 labeled log lines (80 for training, 20 for evaluation), fine-tune a LoRA on DistilBERT, evaluate, and compare to the base model with no fine-tuning. The training run takes 30-60 seconds on a T4 GPU and 3-5 minutes on CPU. Either works.

By the end you will have seen: - A real dataset constructed for a real task, with the labeling discipline that matters. - A baseline evaluation of the base model (it will be bad — around 30-40% accuracy, which is barely above random). - A LoRA training loop using HuggingFace transformers and the peft library. - A second evaluation (it will be much better — 80-95% accuracy, depending on the random seed). - The size of the resulting adapter (megabytes, not gigabytes).

You will not have built the world's best log classifier. You will have built the workflow. That workflow generalizes — to 5,000 real log lines, to 50 vendor command categories, to whatever shape your team's labeled data takes. The reason this chapter exists is to remove the intimidation that surrounds the word "fine-tuning" and replace it with a concrete, finished thing on your machine.

What comes next

Chapter 04 is RAG — retrieval-augmented generation. RAG is the alternative to fine-tuning for the knowledge problem we deferred above. Fine-tuning teaches the model how to behave; RAG gives the model what it needs to know. Most production AI systems in 2026 use both: a LoRA-tuned model for style/vocabulary/task and a RAG layer for the changing facts.

After chapter 04 you will have all three pieces — base behavior, geometry, knowledge — and the rest of the course is about composing them. Agents (chapter 05) are what you build when you wire those pieces to act. MCP and skills (chapter 06) are how you give them the tools to act on your systems. Claude Code (chapter 07) is the immediate practical container for many of these ideas in your daily work. Full-stack Python (chapter 08) is how you ship the whole thing.

For now: open the notebook. Train a tiny classifier. Watch the accuracy jump. Read the size of your adapter on disk. Internalize this: the thing you have just done would have required eight figures of investment ten years ago, an academic ML team five years ago, and a serious GPU budget two years ago. Today it ran on free Colab in 60 seconds.


Field exercise: find 200 real log lines from your environment. Label them by hand into 4-6 categories. Run the notebook on your data instead of the synthetic data. Compare accuracy. You may discover the most useful piece of automation your team will deploy this quarter.

Wrong way to use this chapter: assume that fine-tuning makes the model smarter. Right way: fine-tuning makes the model narrower — better at the thing you trained for, often slightly worse at unrelated things. Use that narrowing surgically.


Pain anchored: T9 (GPU/infra readiness gap — Cisco AI Readiness Index 2024) + the engineer-level pain that base models do not speak your team's vocabulary. Maps to: chapter 03-fine-tuning. Pairs with existing polished chapters fine-tuning-for-network-engineering.md (hands-on SFT/QLoRA/Unsloth) and rlhf-networking-course.md (theory SFT→DPO→GRPO) in this folder.