Chapter 01

Foundations

When the math is enough

It's 3 AM and 14,000 log lines.

Cluster 200 syslog lines into 5 groups; surface the anomalies with k-means + TF-IDF.

pythonsklearnTF-IDFk-means

The 3 AM question

It's 03:14. PagerDuty fires. Severity 2. The customer says "something is broken in the network, please look." The SIEM dumps 14,728 syslog lines from the last hour into your terminal. You have one coffee.

You don't have time to read 14,728 lines. You don't even have time to grep — you don't know what to grep for. The customer didn't tell you. The customer doesn't know either; they just see a website slow.

The question this chapter answers: can a piece of math read those 14,728 lines for me and point at the weird ones?

The answer is yes, and it doesn't need an LLM. Pre-2010 math is enough. That's what this chapter is — the math you should reach for before you reach for Claude.

Why not just ask the LLM

A reasonable first thought: paste the lines into Claude, ask "what's wrong here?" There are three reasons that's the wrong move.

It's slow. 14k lines is a long prompt. Claude takes seconds; classical clustering on the same data takes milliseconds. At 03:14, that gap is the difference between fixing it and writing a postmortem.

It's expensive. Every paged engineer doing this every shift adds up. Anomaly detection on log streams is a continuous job, not a one-off question. You want something that runs on the syslog server itself, not a remote API.

It hallucinates. An LLM will confidently invent a story to explain logs it doesn't recognize. A k-means cluster of 5,000 points has no story to invent. It groups, you look, you decide. The reasoning stays in your head, where it belongs.

Classical ML is the right hammer for this nail. The rest of this chapter is about which version of the hammer you reach for, and when.

The three pillars

Three families of classical ML cover most of what you'll need before chapter 03. You should be able to name them without thinking.

Clustering is unsupervised. You give it a pile of things, no labels, and it groups what looks similar. For our 14k syslog lines, this is the answer — you don't have labels, you just have lines. K-means and DBSCAN are the two you'll meet.

Classification is supervised. You give it a pile of labeled things ("this line is a security event, this one is a routine info message"), and it learns to label new things the same way. You'll use this when you've spent a quarter building a labeled dataset and now want to auto-triage.

Regression is supervised too, but instead of a label, it predicts a number. Useful for "given today's traffic pattern, what's the load at 9 AM tomorrow?" — capacity planning territory. We'll come back to this in the time-series chapter.

For tonight, the syslog flood, clustering wins.

K-means in sixty seconds

I'll explain k-means without math first, then with one equation, in that order. If you only read the first paragraph, you'll still get it.

Imagine you're sorting 200 marbles into 5 piles by color, blindfolded. You put 5 plates on the table at random spots. You pick up marbles one by one and place each on the plate closest to it. After all 200 marbles are placed, you slide each plate to sit at the center of its current pile. Then you re-sort — some marbles are now closer to a different plate, so they move. You re-center the plates again. Repeat until nothing moves. The plates are now sitting on the natural color centers.

That's k-means. The "5 plates" are centroids. The "color" is whatever set of numbers describes each marble. Math version:

1. Pick k initial centroids at random
2. Assign each point to its nearest centroid
3. Move each centroid to the mean of its assigned points
4. If anything moved, go to step 2

That's the whole algorithm. The only knob is k — how many piles you want. Choosing k is the hard part, and we'll come back to it.

The marble analogy hides one trick: in our case, the "marbles" are syslog lines. Syslog lines are text. K-means needs numbers. So we need to turn text into numbers first.

Turning text into numbers

This is where most beginners stumble. Words aren't math. You can't take the mean of "Interface GigabitEthernet0/0/1 is up" and "BGP neighbor 10.0.0.1 Down".

The trick is called TF-IDF: term frequency, inverse document frequency. It builds a vector for each line.

Think of every distinct word as a column. Every syslog line is a row. The number in each cell is "how often does this word appear in this line, weighted by how rare it is across all lines." Common words like is and the get small numbers. Rare words like BGP or OSPF get bigger numbers. Very rare words like a specific IP that appears once get the biggest numbers.

The result: each syslog line is now a vector of ~500 numbers (one per distinct word). K-means doesn't care that those numbers came from words. It just sees vectors. It clusters them.

The bet TF-IDF makes is that lines about the same event will share rare words — OSPF, Down, the same interface name — and end up close in vector space. Most of the time, that bet pays off.

It does not pay off when log lines are templated tightly (every line uses the same vocabulary) or when the rare tokens are IPs that change every minute. We'll see that breakage in the notebook.

What the cluster will tell you

After clustering, you'll have something like:

Cluster 0:  4,821 lines — link up/down on access ports
Cluster 1:  3,902 lines — DHCP lease info
Cluster 2:  2,447 lines — routing protocol keepalives
Cluster 3:    156 lines — auth failures from one IP
Cluster 4:      8 lines — ???

You don't need to read 14,728 lines. You read 8. Cluster 4 is your anomaly. Maybe it's nothing. Maybe it's the customer's slow website. Either way, that's where your coffee goes.

This is the core move of unsupervised anomaly detection: size as signal. The smallest cluster is rarely noise — it's usually the thing nobody has seen before. The math doesn't know what's "normal," but it can tell you what's rare.

Where this falls down

I owe you the failure modes, because the notebook will show them.

Choosing k is guesswork. Set k=3 and you might merge auth failures with routing keepalives. Set k=20 and you'll fragment normal traffic into useless sub-groups. There's a heuristic called the elbow method — plot how tight the clusters are vs. k, look for the bend — but it's eyeballing a graph, not a formula. DBSCAN avoids this by not asking for k at all; it asks for a density threshold instead.

High dimensions hurt. TF-IDF on 14k lines might give you 5,000 unique words. K-means in 5,000-dimensional space behaves badly — distances stop being meaningful (this is called the curse of dimensionality). You'll need to either prune the vocabulary or project down with PCA.

TF-IDF doesn't understand structure. It treats GigabitEthernet0/0/1 and Gi0/0/1 as totally unrelated tokens. It treats interface and int the same way. To it, error rate 0.001% and error rate 99% look almost identical — same words, different number. This is what transformers fix in chapter 02.

Where you go next

When the math is enough, use it. Most of the day-to-day pattern-finding in network operations does not need an LLM. K-means and TF-IDF together cover an enormous amount of ground — anomaly hunting, log triage, ticket grouping, alert deduplication.

When the math is not enough — when meaning matters, when you need to know that interface is down and port flapping are the same thing — you need something that understands language, not just word frequency. That's chapter 02.

For now: open the notebook. Run it. Replace my synthetic syslog with real lines from your environment. Watch where the 5 clusters land. You'll see immediately whether the assumption "small cluster = anomaly" holds for your traffic. If it does, you have a tool for the rest of your career. If it doesn't, you'll know exactly why — and the why is what gets you to chapter 02.


Field exercise: export an hour of syslog from one of your switches. Paste 100-200 lines into the notebook's synthetic_logs cell. Run the rest unchanged. Read the smallest cluster. Was it noise or signal? Either answer is information.

Wrong way to use this chapter: memorize k-means. Right way: notice that you have a problem (too many logs, no time to read them), reach for the simplest tool that solves it (k-means + TF-IDF), and accept that the tool will be wrong sometimes and you'll learn to read those wrongs.


Pain anchored: alert fatigue / log sprawl (T1 in the pain-point manifest). Maps to: chapter 01-foundations. Pairs with the polished chapters time-series-for-networking.md and 30-AI-Algorithms-for-Network-Engineers.md already in this folder.