Python for Network Engineers
When the keyboard isn't enough
What you'll build
- 50-device config-backup tool with retry logic + Git integration
- Multi-vendor CLI parser (Cisco IOS/NX-OS, Junos, Arista EOS) using only string methods
- Streaming syslog histogram from a 100MB file — never load it all in memory
- Resilient BGP-neighbor state extractor across 20 routers
- Subnet calculator toolkit on top of Python's `ipaddress` library
Pain we solve
"I can't SSH into 50 boxes one at a time anymore." Bash loops break on the 14th device. This module replaces them with proper Python automation that handles vendor mix, timeouts, retries, and version control.
Data Wrangling & Exploration
When the counter is the truth
What you'll build
- BGP flap detector — 30 days of state-change logs, 1-hour rolling windows, ranked offenders
- NetFlow analysis: top-N talkers, per-protocol traffic matrix, flow-count histograms
- Cleaning pipeline that handles 5% missing, 3% duplicate, 2% counter-rollover rows
- Multi-panel latency report — per-region, with incident annotations, pod-to-pod heatmap
Pain we solve
"My 32-bit interface counter rolled over and the dashboard lied." Real operational data is dirty, has gaps, and contains rare-but-real edge cases. This module teaches you to model it correctly and tell the truth.
ML Foundations (Classical)
When the model has to defend itself
What you'll build
- Logistic regression trained by hand — log-loss computed from scratch, matched to sklearn within 0.001
- XGBoost severity classifier on 50K incidents with isotonic calibration
- HDBSCAN clustering on 200K cleaned NetFlow rows — identifying DDoS-style outliers
- SHAP + LIME + PDP explainers on 3 P1 incidents — defensible feature attribution
- Causal analysis: Did WAN optimization actually reduce latency? Difference-in-differences on 200 sites × 6 months
Pain we solve
"The model says P1 with 95% confidence and it's wrong half the time." This module teaches calibration, interpretability, and causal reasoning — so the model defends its own outputs to the change-management board.
Deep Learning & NLP/LLMs
When the model needs the runbook
What you'll build
- 3-layer PyTorch classifier on 200K flows — beats your XGBoost baseline
- Graph Attention Network for link-failure prediction on a 500-node ISP topology
- SetFit fine-tune on 200 hand-labeled incident summaries — better than zero-shot Llama-3
- NetOps RAG assistant over 500 runbooks + 6 months of postmortems — MRR + LLM-as-judge eval
- GraphRAG for multi-hop root-cause analysis — beats vector RAG on 10 hand-crafted RCA queries
- Federated learning across 5 simulated regions — non-IID data
Pain we solve
"The LLM doesn't know my topology and won't admit it." This module turns generic LLMs into ones that understand your runbooks, your topology, and your incident history.
Agentic AI & Frameworks
When the LLM has to act
What you'll build
- 3-agent NetOps team in CrewAI — Analyzer + Planner + Executor — solving 10 incident scenarios
- ReAct troubleshooting agent from scratch — no framework — then ported to CrewAI and LangGraph
- Multi-agent guardrail layer: dry-run, confirmation token, HITL on state-changing actions
- MCP integration: 3-agent crew calls tools through a Model Context Protocol server
- A2A (Agent-to-Agent) protocol for cross-framework agent communication
Pain we solve
"I want an agent to fix BGP without rebooting the wrong router." This module teaches multi-agent architecture with explicit safety patterns — no autonomous production actions without human-in-the-loop gates.
Production Deployment & MLOps
When the demo meets 100 operators
What you'll build
- FastAPI gateway in front of the M5 crew — auth, rate limiting, tiered routing
- Redis prompt cache + PostgresSaver for conversation state across restarts
- 100-concurrent-operator load test — measure p50/p95/p99 + cost per query
- Drift detection on a 1000-query stream — catch shifts within 50 requests
- Production-grade evaluation: LLM-as-judge + golden context dataset + offline benchmarks
- Observability: Opik traces from gateway → agent → MCP → backend
Pain we solve
"The notebook worked. The 100-user load test didn't." This module is the bridge from "I built it" to "I run it" — cost discipline, observability, eval gates, and incident response built in.