Chapter 04

LLM Apps

When the model needs to look things up

10,000 config files, no way to search them.

Hand-build a RAG pipeline over 20 device configs — chunking, embedding, retrieval, optional generation.

RAGchunkingretrieval

The 10,000-config question

Your team manages 700 devices. Each device has a running config of roughly 800 lines. That's 560,000 lines of network configuration in your environment. You have a Confluence page that lists naming conventions. You have a Git repo of templates. You have a NetBox that knows roughly which device sits where. And you have an outage at 02:30 because someone changed a route map on a single PE router an hour ago and a downstream BGP session is now flapping every 120 seconds.

The question: can the model help me find the change?

Last chapter you would have answered, no — the model doesn't know my configs. You were right. Fine-tuning could bake the configs into the model's weights, but configs change daily and retraining nightly is absurd. Embeddings (chapter 02) could help find similar configs, but they don't give you answers.

The technique that combines them is called RAG — Retrieval-Augmented Generation. RAG is how every production AI assistant in 2026 actually works, including the ones that look like a single LLM call. By the end of this chapter you will know the three moving parts, will have built a tiny RAG over a synthetic config corpus, and will see the failure modes that bite in production.

The question this chapter answers: how do I make the model answer questions about data it has never seen, with citations to that data, without retraining?

Why RAG won, and where fine-tuning still wins

Chapter 03 ended with a decision tree that put RAG and fine-tuning on different branches. Now we can say why RAG won as the default for knowledge problems.

RAG is fresh. The retrieval step runs at query time, not at training time. New configs you pushed at 02:00 are searchable at 02:01. Fine-tuning, in contrast, bakes knowledge into weights at training time — to update, you retrain. For data that changes daily, the asymmetry is brutal.

RAG is auditable. Every answer comes with the chunks of source data that produced it. When the model says "the BGP session is configured with hold-time 30 seconds," you can click through to the exact line of the exact config it pulled. With fine-tuning, the model "knows" something but can't show you where it learned it.

RAG is cheap to update. Adding a new device to a RAG system is one embedding call per chunk and an insert into the vector store. Adding it to a fine-tuned model is a new training run. Three orders of magnitude in cost difference, possibly more.

RAG is failable safely. When retrieval finds nothing, the model can be instructed to say "I don't have data on that," and it does. When fine-tuning fails to memorize something, the model confabulates plausibly because the next token has to come from somewhere. RAG fails loud; fine-tuning fails silent.

Fine-tuning wins in three specific places: when you want a particular style the model can't be prompted into (chapter 03 case), when you want a behavior like "always refuse to suggest destructive commands," and when you've measured that RAG isn't enough and you need both layers. In a real production stack, you usually have both: a slightly fine-tuned model for behavior, a RAG layer for facts. Most teams add RAG first because it is the lower-risk, higher-value move.

The three moving parts

A RAG system has three pieces. You should be able to name them and sketch what each does in two sentences.

Retrieval is the search piece. You have a corpus — configs, runbooks, tickets, vendor docs — chunked into searchable units. You embed each chunk into a vector (chapter 02 territory). When a query arrives, you embed the query, find the chunks whose embeddings are closest, and return those chunks. The output of retrieval is raw text excerpts, not answers.

Augmentation is the prompt-construction piece. You take the user's question, glue the retrieved chunks onto it as context, and form a single prompt that says, roughly: "You are an assistant. Use the context below to answer the user's question. If the context doesn't contain the answer, say so. Context: . Question: ." The augmentation is where you control how the model uses the context — does it have to cite? Can it ignore the context? Should it follow a specific output format?

Generation is the LLM call. The fully-constructed prompt goes to Claude, GPT, Llama, or whatever model you use. The model produces text. Crucially, the model has the context in its prompt, not in its weights — which means it sees data that was inserted seconds ago, not data it was trained on months ago.

Most beginners imagine that "the AI knows things." Practitioners know that the LLM is mostly a fluent reader. RAG is the trick that takes advantage of this: give the model the right thing to read, at the right time, and it will produce a useful answer. Without RAG, you're hoping the right thing was in the training data. With RAG, you put it there yourself.

Chunks: the choice nobody talks about but everyone gets wrong

The biggest practical question in RAG is how do I chunk my data? It sounds simple. It is not. Get this wrong and the rest of the system underperforms regardless of how good your embedding model is.

Three competing pressures push on chunk size.

Too small and you lose context. A chunk that contains "hold-time 30" but not "neighbor 10.0.0.1" is useless — the engineer asking "what's the hold-time on the ISP-A neighbor?" will get a chunk that matches the keyword but lacks the answer's referent. The model produces nonsense.

Too large and you lose precision. A chunk that contains an entire 800-line config is technically a match for any query about that device, but it dilutes the embedding — the vector encodes "this is a big network config" more than "this is the BGP section." Top-k retrieval returns a few of these massive chunks and the relevant signal gets buried.

Wrong boundaries and you split the meaning. Cutting a config in the middle of an ACL means one chunk has the permit lines without the deny lines, and another has the reverse. Cutting between an interface declaration and its description, ip address, and shutdown lines fragments the unit of meaning.

For network configurations the sane chunking strategy is stanza-aware. Cisco-style configs have clear boundaries: blank lines or ! separators, and indentation patterns that mark sub-blocks. Each stanza (one interface, one route-map, one ACL) becomes one chunk. Some stanzas are too long (a big route-map) and need a secondary split. Some are too short (a single hostname line) and should be merged with their siblings.

For runbooks and prose, paragraph-level chunking with overlap (e.g., 500 tokens with 50-token overlap between adjacent chunks) is the conventional starting point. The overlap exists so that a fact that straddles a paragraph boundary isn't lost from both chunks.

For source code and structured documents, chunking by AST node or by document heading beats character-count chunking. The principle: respect the semantic unit. If your data has natural boundaries, use them. If it doesn't, use a moderate chunk size (300-600 tokens) with overlap.

The notebook will show stanza-aware chunking on a synthetic IOS config corpus. You'll see chunks where the meaning stays intact and chunks where it doesn't, and you'll feel the tradeoff in your hands.

Top-k retrieval and why k is itself an interesting knob

After chunks are embedded and stored, the retrieval step needs to pick which chunks to send to the model for each query. The standard approach: compute cosine similarity between the query embedding and every chunk embedding, take the top k matches.

The choice of k matters more than people expect.

k=1 sends only the single best match to the model. Cheapest in tokens, but brittle — if retrieval gets the right chunk wrong by even one position, the model has nothing to work with.

k=5 is the typical default. Five chunks usually contains the right one. The model sees enough context to triangulate. Cost is modest.

k=20 sends a lot of context. The model has to scan and decide which chunks are relevant before answering. Tokens cost more. Sometimes accuracy improves; sometimes it gets worse due to the "lost in the middle" problem (chapter 02 mentioned this) where the model's attention degrades on long contexts.

The optimal k depends on your data and your model. Some teams use adaptive k — return chunks until cumulative similarity drops below a threshold. Some use reranking — retrieve k=20 cheaply, then run a more expensive reranker model that scores each chunk for actual relevance, keep the top 5 of those. The notebook stays simple at k=3 for clarity, but the hooks are there to scale up.

The deeper point: retrieval is a search problem with the same shape as Google's. Recall (did the right chunk appear in the top k?) and precision (is the chunk we returned actually relevant?) tradeoff against each other. The same evaluation discipline applies — build a small labeled eval set (here are 20 questions, here are the chunks that contain the answers), measure recall@1, @5, @20, iterate.

Hybrid retrieval: when embeddings alone aren't enough

Cosine-similarity-on-embeddings (called dense retrieval) is good at semantic matching: it finds chunks that mean similar things to the query. It is bad at exact-match retrieval: when the user types "BGP neighbor 10.0.0.5" they almost always want the chunk that contains literally that IP, and embeddings might rank it second behind a chunk that talks more generally about BGP.

The classical alternative is BM25 — a term-frequency-based score that is good at exact matching and rare-token matching. BM25 is what Elasticsearch and Solr have been doing for two decades. It's fast, deterministic, and complements dense retrieval well.

The production-quality approach is hybrid retrieval: run both BM25 and dense retrieval in parallel, combine the rankings (a simple weighted score is fine — say, 0.4 × BM25_score + 0.6 × cosine_score), then return the top k of the combined list. Hybrid retrieval beats either alone on most benchmarks. For network configs especially — where the relevant chunk often contains a specific IP, hostname, or VRF name that the user mentioned by name — hybrid is the default for any serious deployment.

The notebook will not implement BM25 to keep the lesson focused. But you should know it exists and reach for it when dense retrieval misses on identifier-heavy queries.

Generation: the part where the prompt matters more than the model

The generation step is the LLM call. People obsess over which model to use here — Claude vs GPT vs Gemini vs a fine-tuned Llama. In practice, the prompt structure matters more than the model choice for RAG quality.

A bad RAG prompt:

"Here's some text and a question. Answer it."

A good RAG prompt:

"You are a network engineering assistant. Use ONLY the configuration excerpts below to answer the user's question. If the answer is not in the excerpts, say 'I don't have information on that.' Cite the excerpt number for any factual claim. Excerpts: . Question: ."

The differences matter. Use ONLY constrains the model from generating from its general knowledge. Cite the excerpt number forces grounded answers. If not in excerpts, say so permits the model to fail safely. Explicit numbering of excerpts gives the model a referent it can use in citations.

For network configs specifically, you want to add: "When showing configuration commands, copy them verbatim from the excerpts. Do not invent syntax." This single line cuts hallucination on syntax-sensitive answers dramatically. The model now treats the excerpts as authoritative source rather than as suggestions.

For most tasks, the cheapest model that is good at instruction-following will produce a fine answer when the retrieval is correct. Claude Haiku, GPT-4.1-mini, Gemini Flash — these all work. Reach for the expensive model (Opus, GPT-5, Gemini Pro) only when the task itself requires reasoning, not just look-up.

Failure modes you will hit

RAG has its own zoo of failure modes. The notebook will let you reproduce a few.

Chunk-size sensitivity. Halve your chunk size and accuracy may drop 20 points. Double it and you may break the model's context window. This is the most common source of "RAG isn't working" complaints, and it's almost always a chunking problem in disguise.

Embedding model mismatch. If you embed your chunks with model A and your queries with model B, similarity scores are meaningless. Less obviously, if your chunks were embedded six months ago with all-MiniLM-L6-v2 and you upgraded to all-mpnet-base-v2 for queries, every chunk in your store needs to be re-embedded. Track which embedding model produced which vectors. Treat the embedding model as part of the schema.

Out-of-distribution queries. Your corpus is configs. The user asks "what's the company's PTO policy?" Retrieval returns the most similar three configs — none of which contain PTO information. If your prompt doesn't tell the model to refuse, it will confabulate an answer. Always include the "if not in context, say so" instruction.

Stale embeddings. Configs change. Embeddings don't auto-update. You need a process that re-embeds chunks when source data changes. Most teams batch this nightly. Production-grade systems do it on every write.

Context bloat. Top-k=20 of 800-token chunks is 16,000 tokens of context per query. At scale, this is a real cost. Reranking lets you keep recall high while dropping tokens. Quantizing the embedding store (FAISS supports this) lets you fit more vectors in RAM.

Eval debt. Without an evaluation set, you cannot tell if a change to chunking, embedding, or prompt helps or hurts. Build the eval set early. Twenty hand-labeled (query, expected-chunk) pairs are enough to start; you'll add to it as the system matures.

None of these failures is catastrophic. All of them are tractable once you recognize them. The notebook will give you the recognition; production gives you the practice.

Why this is the chapter that unlocks the rest of the course

The remaining chapters of this course — agents, MCP & skills, Claude Code, full-stack Python — all build on what you learned here. An agent (chapter 05) is, in one framing, a RAG system that can call tools. An MCP server (chapter 06) is a way to make your operational systems queryable by RAG. Claude Code (chapter 07) is a RAG system you can build by writing a single CLAUDE.md file. A full-stack Python app with AI (chapter 08) is a UI around a RAG system.

Once you have built one RAG by hand — chunked, embedded, retrieved, augmented, generated — the abstractions in the later chapters will feel familiar. Without that hands-on, the later chapters feel like magic. With it, they feel like composition.

This is also the chapter where most network engineers, in my experience, have their click moment. The first time you query your own configs by natural language and get a correct answer with citations, the field stops feeling like science fiction and starts feeling like tooling you'd reach for at 2 AM. That click is worth more than any specific technique in this course.

What the notebook will give you

The notebook walks through five steps. Two are interesting; three are plumbing.

Step 1: build a corpus. Twenty synthetic Cisco IOS configs, one per device. Realistic stanzas — interfaces, ACLs, route-maps, BGP neighbors. Hardcoded so the notebook is self-contained.

Step 2: chunk. Stanza-aware splitter. You'll see the chunk boundaries land in sensible places, and a few places where they don't.

Step 3: embed. Same all-MiniLM-L6-v2 from chapter 02. Stored as a NumPy array (no FAISS dependency — we keep it minimal). You'll see the embedding matrix shape.

Step 4: query. Five sample queries that exercise different retrieval failure modes. "What's the hold-time on the ISP-A BGP neighbor?" should retrieve cleanly. "Which device has VRF 'red' configured?" should also retrieve cleanly. "What's the company's password policy?" should retrieve garbage and let us discuss out-of-distribution behavior.

Step 5: generate (optional). If you have an Anthropic API key, the final cell calls Claude with the retrieved chunks as context and shows the grounded answer with citations. If you don't, we print the augmented prompt — you can see exactly what would be sent.

By the end you will have hand-built the full RAG loop. The next time you read about Pinecone or Weaviate or LangChain or LlamaIndex, you'll know they are wrappers around the same five steps, with more knobs and more scale.

What comes next

Chapter 05 is agents. Agents are the layer that sits above RAG — the model not only reads context, it decides what context to fetch, then iterates. Where RAG is one lookup, an agent is a loop of lookups, tool calls, and decisions. The decision-making is where things get powerful and where they get dangerous, which is why blast-radius patterns and dry-run conventions matter.

For now: run the notebook. Watch the retrieval find your synthetic ISP-A config. Read the augmented prompt. Internalize that the LLM is the easy part — the data plumbing is the hard part, and you just did it. That asymmetry is the most useful frame for the rest of the course.


Field exercise: export a single switch's running config. Paste it into the notebook's device_configs variable as one entry. Run the rest of the cells unchanged. Ask the notebook five questions about your switch. You may discover the bones of your team's first internal AI tool.

Wrong way to use this chapter: treat RAG as a turnkey black-box. Right way: see RAG as five steps you can debug at each step, knowing which step is hurting your accuracy when something goes wrong.


Pain anchored: T1 (alert/log sprawl) + T5 (tooling fragmentation, no unified query). RAG is the technique that gives you the "unified query" experience without forcing the underlying tools to standardize. Maps to: chapter 04-llm-apps. Currently no polished sibling chapters — this trainings/ folder is the chapter's first content.