Deep Learning · vExpertAI Academy

The question that broke TF-IDF

Last chapter ended with a problem we couldn't fix. K-means and TF-IDF can group syslog lines that look alike. They cannot group lines that mean the same thing in different words.

Show your engineer two log lines:

%LINEPROTO-5-UPDOWN: Line protocol on Interface Gi0/0/3, changed state to down
%LINK-3-UPDOWN: Interface GigabitEthernet0/0/3, changed state to down

She knows these describe the same event. Same port, same time, different message format. To TF-IDF, the words are mostly different — LINEPROTO is not LINK, Gi0/0/3 is not GigabitEthernet0/0/3. The vectors end up far apart. The cluster algorithm puts them in different bins.

This chapter is about the family of models that fixed that.

The question it answers: how does a machine learn that "interface down" and "port flapping" mean the same thing — when those phrases share zero words?

The answer is a particular trick called embeddings, produced by a particular kind of model called a transformer. By the end of this chapter you will know what they are, why they work, and where they will still fail you. We won't train one. We'll look at one and watch it do its work.

The pain that pulled the field forward

Two pieces of news from the production world tell you why the industry now cares about transformers — beyond the chatbot hype.

The first is private and personal. Every network engineer who has tried to paste a real config into Claude or GPT and ask "what's wrong here?" has watched the model confidently invent a syntax that doesn't exist. interface description "load-balanced" becomes interface description "load-balance", an Arista command turns into a Cisco one, an ACL gets rewritten with a directive that compiles on no platform. The model is fluent. It is not correct. Engineers learn to distrust it for syntax-sensitive work and to use it only for prose.

The second is public. In July 2024, Cloudflare reported that AI training crawlers — GPTBot, Bytespider, ClaudeBot — were generating tens of billions of requests per day against origin servers. By 2025 some site operators were seeing 30-70% of their traffic come from these bots, often hammering pages no human visits. The networking team's job got harder overnight: distinguish real users from bots that look like real users. Pattern matching on user-agent strings stops working when bots forge them. You need something that recognizes intent, not syntax.

Both problems share one shape: the syntax is unreliable; the meaning is what matters. TF-IDF is a syntax tool. Transformers are a meaning tool. That's the whole reason this chapter exists.

The three ideas you need

A transformer is not one idea. It's three, stacked. You should be able to name them and roughly say what each does. The notebook lets you see them.

Tokens are how the model chops text. Not into words — into sub-word fragments chosen by an algorithm called BPE (byte-pair encoding) or similar. The word GigabitEthernet0/0/1 is not one token to a typical model. It might be five or eight, depending on the tokenizer. Each token gets a numeric ID. Text becomes a sequence of IDs.

Embeddings are what tokens become inside the model. Each token ID is looked up in a giant table that returns a vector of, say, 768 numbers. Similar tokens — synonyms, alternative phrasings, related terms — point in similar directions in this 768-dimensional space. The geometry encodes meaning. This is the core trick. "Interface" and "port" sit close. "Down" and "fail" sit close. "Cisco" and "Arista" sit close. The model learned these geometries by reading internet-scale text and being asked, billions of times, to predict missing words.

Attention is the mechanism that lets a token look at other tokens in the same sequence and decide which ones matter for it. When the model is processing the word it in "the router rejected the BGP session because it had a config error", attention is what lets the model figure out that it refers to session, not router. Each layer of a transformer is mostly stacks of attention plus some linear math. Modern models have dozens to hundreds of these layers.

That's it. Tokens are how text enters. Embeddings are how text becomes geometry. Attention is how the geometry interacts with itself to produce understanding. Everything else — context length, fine-tuning, RLHF, mixture-of-experts — is variation on these three.

Why hallucination is mathematical, not moral

People talk about LLM hallucination like the model is being naughty. It isn't. The model has one job and it does it: predict the most likely next token, given everything before. Always. It cannot decline. It cannot admit it doesn't know. It can be trained to say "I don't know" — but at the base level, every token comes from a probability distribution over the whole vocabulary, and the model picks one.

If you ask a model for a Cisco command it has seen a thousand times in training, the most likely next tokens form the correct command. If you ask for a command it has seen ten times in training, scattered across documentation pages, blog posts, and forum discussions that disagree with each other, the model still has to produce something — and what it produces is a blend of those scattered examples, with the most common patterns winning. The model doesn't know that this blend is invalid syntax. It only knows that these tokens are the most likely given what came before.

This is why config hallucination happens on rare commands but not on common ones. The math is the same in both cases. The training data isn't. A model cannot know what it doesn't know. When we get to RAG (chapter 04) and fine-tuning (chapter 03), we'll see two different ways to fix this — but both fixes work by giving the model more data, not by giving it humility.

This matters for network engineers more than for most users. Network configurations are syntax-strict. A missing comma fails the compile. A typo on an ACL bricks the firewall. The whole class of problems the LLM is good at — writing fluent prose, summarizing, brainstorming — is the wrong shape for the work. The class of problems where it hallucinates — exact syntax, version-specific commands, vendor-specific gotchas — is exactly your work. Knowing this from the math (rather than learning it from a 2 AM outage) is the first defense.

Decoder-only, encoder-only, and the difference that matters

You will read about three transformer "shapes" in the literature. They are not separate technologies; they are the same math wired up differently.

Decoder-only is GPT-style. The model reads tokens left-to-right and predicts the next one. ChatGPT, Claude, Llama, Gemini — all decoder-only. They are the right tool when you want generation: write me a config, summarize this incident, draft this runbook.

Encoder-only is BERT-style. The model reads tokens in both directions and produces an embedding for each one. BERT, RoBERTa, sentence-transformers — all encoder-only. They are the right tool when you want understanding without generation: cluster these logs, find similar tickets, classify intent. They cannot write. They can only measure.

Encoder-decoder is T5-style and original-paper-style. Reads input, generates output. Used in translation, summarization, structured generation. You will rarely encounter it directly in 2026 — most production work has moved to decoder-only with system prompts.

The practical takeaway: when you want meaning (find similar configs, group tickets by intent, detect that a log line is unusual), reach for a sentence-transformer encoder model. Cheap, fast, CPU-OK. When you want generation (write the runbook, suggest the next command), reach for Claude or GPT. Slow, expensive, hallucination-prone. The right model is the smallest one that solves your problem. Using GPT-5 to compute embeddings is like renting a crane to hammer a nail.

What you'll actually see in the notebook

The notebook walks you through three exercises. None of them require a GPU. None of them require an API key.

First, you'll tokenize Cisco IOS configs with three different tokenizers — GPT-2's, BERT's, and a llama-family tokenizer. You'll see that they all chop network-specific terms badly. GigabitEthernet0/0/1 becomes 6-9 tokens depending on which model. ip access-list extended ACL-BLOCK-RFC1918 becomes another dozen. The exercise makes a point: the model is reading your config as broken fragments, not as the structured object you wrote.

Second, you'll embed a list of paired log lines — "interface down" / "port flapping", "high CPU" / "router overloaded", "packet drops" / "loss observed" — and compute pairwise cosine similarities. You'll see that pairs the model considers synonymous sit at 0.7+ similarity, while random pairs sit at 0.3-0.5. This is the moment TF-IDF could not have given you. Two log lines with zero shared words now appear close in vector space.

Third, you'll project the embeddings down to 2D with the same PCA trick from chapter 01, and plot them. Synonyms will cluster visibly. You'll see the geometry that everyone talks about, with your own eyes. That single plot is, I think, the most important thing in this chapter — because it makes the abstraction physical.

Where transformers will still fail you

I owe you the failure modes. The notebook will show some of them; field experience will show the rest.

Domain shift is brutal. A model trained on internet text understands "interface" as a UI concept three times out of four. It understands "router" as a wood tool one time out of ten. When you say "the GigE interface flapped after the WRED policy change," a base model has roughly the comprehension of a smart liberal-arts grad. Useful, but missing context. Fine-tuning (chapter 03) is how you fix this. So is RAG (chapter 04). Neither is a one-line solution.

Tokenizers are not your friend. As you'll see in the notebook, a model that splits your interface name into seven fragments cannot reason about it as one thing. This is why "AI for log parsing" demos look impressive but production deployments struggle — production logs are full of identifiers (IPs, MACs, UUIDs, port numbers) that fragment unhelpfully. Embedding models trained on code do better here than ones trained on prose, but the floor is still uncomfortable.

Context length is not infinite. When you read about a "200K context window," what you actually have is a window in which the model can attend — but attention quality degrades over the window. Place an important fact 180K tokens deep in a prompt and the model will miss it more often than if you place it at the top. This is sometimes called the "lost in the middle" problem. For network work where you might want to feed in a 30-device topology, this matters: don't dump and pray. Structure.

Embeddings are direction-only. Cosine similarity measures the angle between two vectors, not the distance. "Big disaster" and "small disaster" point in roughly the same direction. Embeddings will not tell you which one happened. They tell you the topic. If you need magnitudes — actual error counts, ticket priorities, blast radius — you need classification or regression on top of embeddings, not embeddings alone.

Bias propagates. A model trained on Reddit, GitHub, and StackOverflow has the engineering culture's biases baked in. It will favor solutions popular among hobbyist sysadmins over enterprise practice. It will know more about consumer routers than about service provider gear. If your job is the boring half of the industry, the model is half-blind to your work. Be aware. Verify. Don't trust.

Two ways to use embeddings tomorrow

This chapter is not just background reading. Two concrete things you can build with sentence-transformer embeddings, in production, this week:

Semantic ticket deduplication. Take the title and first paragraph of every incoming ticket. Embed it. Compute cosine similarity against the last 30 days of tickets. If similarity > 0.85 against an open ticket, suggest the new one is a duplicate. If similarity > 0.85 against a closed ticket, suggest the resolution from history. The model that's good enough for this fits in 100 MB and runs on a Raspberry Pi. You don't need GPT-5.

Runbook search. Embed every paragraph of every runbook in your org. Embed an engineer's natural-language query ("VPN keeps dropping every 3 hours"). Return the top 5 matching paragraphs. This is the beginning of RAG, which chapter 04 will make rigorous. Done crudely, it still beats keyword search on internal wikis — because nobody titles their runbook with the exact phrase someone searches for at 2 AM.

Both of these are cheap. Both work. Both could be running in your environment by Friday. The reason most teams haven't built them is not technology — the libraries are mature, the models are free — but pattern-recognition. Until you see embeddings work once, you don't reach for them. The notebook is the seeing it once.

What comes next

This chapter taught you what transformers are and what they're for. The next two chapters are about making them yours.

Chapter 03 is fine-tuning: how to take a pretrained model and bend it toward your network's vocabulary, your team's conventions, your specific problem. It's where the GPU readiness gap (the second pain we opened with) shows up, and where you'll learn what hardware you actually need to participate.

Chapter 04 is LLM applications: how to combine retrieval with generation so the model has facts to ground its answers in. This is the technical name for "stop the LLM from hallucinating about my network." When chapter 04 ends, you'll be able to explain why a retrieval-augmented system can answer a question about a config it has never seen, and what the failure modes of that system are.

For now: run the notebook. Watch the synonyms find each other. Notice how broken the tokenization looks. Sit with the discomfort that the most powerful tool in the industry sees your work as a sequence of fragments, not as the structured artifact you wrote. That discomfort is correct. Working around it is the rest of the course.

Field exercise: take 50 tickets from your real ticket queue. Embed them with the notebook's model. Compute pairwise cosine similarity. Sort by similarity desc. Read the top 10 pairs. How many of them are duplicates that closed independently? You may discover a productivity improvement worth shipping.

Wrong way to use this chapter: treat transformers as magic. Right way: notice that they are a particular trick (embeddings → attention) that gives the model a particular superpower (geometric understanding of meaning) with particular failure modes (hallucination, domain shift, broken tokenization on identifiers). Use the superpower where it fits. Avoid it where it doesn't.

Pain anchored: T4 (LLM hallucinates config syntax) + T8 (AI training crawlers eating origin bandwidth — Cloudflare verified, Cloudflare blog post on AI bots and request volume). Maps to: chapter 02-deep-learning. First polished chapter in this folder.