Software Engineering Roadmap

Software Engineering Roadmap · 12 Months

From good coder to great engineer.

Six modules. Each builds on the last. Technical skill gets you in the room — the full stack of skills here determines what you do once you're there.

Five Core Principles — apply from day one

Think before typing

Five minutes of design saves fifty of debugging.

Tradeoffs everywhere

Every decision has a cost. Name it explicitly.

Make it work first

Correct before fast. Simple before clever.

Observe everything

You cannot improve what you cannot measure.

Communication is leverage

A doc touches 50 engineers. Code touches 5.

Before you design systems or deploy to the cloud, you need bedrock. These six areas aren't just prerequisites — they're the cognitive tools you'll reach for every day for the rest of your career.

📍 Module Overview

Property	Value
Phase	Months 1–3
Topics	6
Primary Tools	Python or TypeScript · PostgreSQL · Linux · Git
Key Mindset Shift	"How do I code this?" → "What is the shape of this problem?"

01 · Data Structures & Algorithms

DSA is the grammar of engineering — not for interviews, but for pattern recognition. When you truly internalize it, you stop asking "how do I code this?" and start asking "what is the shape of this problem?"

💡 The Right Mindset: Every data structure is a tradeoff between speed, memory, and flexibility. Ask: "What operations do I need to be fast?" The answer determines the structure.

Big-O: Cost Before Code

Big-O Complexity — growth rate as n → ∞

O(1)

Hash map lookup, array access by index

O(log n)

Binary search, balanced BST operations

O(n)

Linear scan, single loop through array

O(n log n)

Merge sort, heap sort, most fast sorts

O(n²)

Bubble sort, nested loops, naive string match

O(2ⁿ)

Recursive Fibonacci, brute-force subsets

Rule: if your algorithm has nested loops, it's probably O(n²) or worse. Always ask before coding.

Core Data Structures

Structure	Lookup	Insert	Delete	Use When
Array	O(1)	O(n)	O(n)	Indexed access, iteration, known size
Hash Map	O(1) avg	O(1) avg	O(1)	Fast key lookup, counting, deduplication
Stack	O(n)	O(1)	O(1)	LIFO — undo, call stack, bracket matching
Queue	O(n)	O(1)	O(1)	FIFO — job queues, BFS, event processing
Binary Tree	O(log n)	O(log n)	O(log n)	Sorted data, range queries, BST
Heap	O(1) min	O(log n)	O(log n)	Priority queues, scheduling, top-K
Graph	O(V+E)	O(1)	O(E)	Relationships, shortest path, dependencies
Trie	O(m)	O(m)	O(m)	Autocomplete, prefix search, dictionaries

Algorithm Patterns

Don't memorize algorithms. Recognize patterns. Almost every problem maps to one of these six.

Two Pointers

SIGNAL: Sorted array, pair/triplet problems

Two-sum, remove duplicates, palindrome check

Sliding Window

SIGNAL: "Contiguous subarray of size k"

Max sum subarray, longest substring without repeat

BFS / DFS

SIGNAL: Graphs, trees, "shortest path"

Level-order traversal, island counting, word ladder

Dynamic Programming

SIGNAL: Overlapping subproblems, optimal substructure

Fibonacci, knapsack, coin change, edit distance

Binary Search

SIGNAL: Sorted data, "find X in sorted..."

Search in rotated array, first/last position

Divide & Conquer

SIGNAL: Can split problem into independent halves

Merge sort, quick sort, matrix multiplication

# Two Pointers — O(n) instead of naive O(n²)
def has_pair_with_sum(arr: list[int], target: int):
    """Find if any two numbers sum to target. Requires sorted input."""
    left, right = 0, len(arr) - 1
    while left < right:
        total = arr[left] + arr[right]
        if total == target:  return (left, right)
        elif total < target: left += 1
        else:                right -= 1
    return None

💡 Study Tip: Don't grind 200 LeetCode problems. Solve 50 well. After each: write down the pattern, explain the complexity, solve it again from scratch. Depth beats breadth.

02 · Computational Thinking

Decomposition

Break complex problems into smaller, manageable sub-problems. Instagram = auth + feed + upload + stories + notifications. Each solvable independently.

Pattern Recognition

Identify similarities with problems you've solved before. "This is just graph traversal." "This is two pointers again." Vocabulary compounds.

Abstraction

Focus on essential details, ignore the irrelevant. A function signature is an abstraction. An API is an abstraction. Know what to hide.

Algorithm Design

Define a step-by-step solution. Not code — the logical sequence of operations. Pseudocode first, always.

The 7-Step Problem-Solving Framework

Step	Action	Why
01	Understand, don't assume	Restate the problem. What are the inputs? Outputs? Edge cases?
02	Explore concrete examples	Trace 2–3 examples by hand. Simple cases first, then edge cases.
03	Identify the pattern	Name the category before reaching for code.
04	Design before typing	5 minutes of pseudocode saves 50 minutes of debugging.
05	Implement incrementally	Simplest working version first. Slow-correct beats fast-wrong.
06	Verify and stress-test	Empty input? n=1? Enormous n?
07	Analyze and reflect	State the Big-O. What would you change? Reflection compounds.

⚠️ Common Mistake: Jumping straight to code is the #1 mistake junior engineers make. The keyboard is not where thinking happens.

03 · Programming Depth

Pick Python or TypeScript. Go deep, not wide. Master one — the rest follow much faster.

Types & Type Safety

Type hints in Python, TypeScript's type system. Catch a whole class of bugs at development time, not production.

Error Handling

Handle every error path explicitly. Don't swallow exceptions. The happy path is easy; the error path reveals craft.

Concurrency

async/await, threading, race conditions. Understand the event loop. Know when concurrency helps vs when it creates complexity.

Memory Management

Reference counting, garbage collection, memory leaks. You don't manage memory manually in Python/JS, but you still need to understand it.

Functional Patterns

map, filter, reduce, pure functions, immutability. These patterns make code easier to test and reason about.

Testing

Unit tests, integration tests. TDD changes how you design code. Write tests before feeling "done" is a habit, not a task.

# ❌ Opaque — reader must decode every symbol
def f(d, t):
    r = []
    for x in d:
        if x['ts'] > t: r.append(x)
    return r

# ✓ Expressive — reads like a sentence
def filter_recent_events(
    events: list[Event],
    cutoff_timestamp: int
) -> list[Event]:
    """Return events newer than the cutoff timestamp."""
    return [e for e in events if e.timestamp > cutoff_timestamp]

04 · Databases & Storage

💡 Core Philosophy: Design backwards from your queries. What reads are most frequent? What writes? This determines schema, indexes, and whether you need SQL, NoSQL, or a mix.

CAP Theorem

Consistency

Every read receives the most recent write or an error. All nodes see the same data at the same time.

PostgreSQLMySQLHBase

Availability

Every request receives a response (not necessarily the latest data). The system keeps responding even when nodes fail.

CouchDBDynamoDBCassandra

Partition Tolerance

The system continues operating despite network partitions between nodes. In practice, P is always required in distributed systems.

All distributed systems

Pick Two — Real-World Tradeoffs

CPsacrifices Availability

When a partition occurs, the system rejects requests rather than return stale data. Best for financial systems, inventory.

MongoDB (strong mode)Redis (single)HBase

APsacrifices Consistency

System stays up and accepts requests even during partitions, but may return stale data. Best for social feeds, caches, DNS.

DynamoDBCassandraCouchDB

CAsacrifices Partition Tolerance

Only possible without network partitions — i.e., a single-node or tightly-coupled cluster. Not viable for distributed systems.

Single-node PostgreSQLSQLite

When not to use SQL

Store Type	Examples	Use When
Document	MongoDB, CouchDB	Flexible, nested, schema-less data
Key-Value	Redis, DynamoDB	Sessions, caching, simple lookups
Column-family	Cassandra, HBase	Time-series, massive write throughput
Graph	Neo4j, Neptune	Relationship-heavy traversal queries
Vector	Pinecone, pgvector	Semantic search, ML embeddings

-- Design for your most common queries first
WITH recent_orders AS (
    SELECT customer_id, SUM(total) AS revenue
    FROM orders
    WHERE created_at >= NOW() - INTERVAL '90 days'
      AND status = 'completed'
    GROUP BY customer_id
)
SELECT c.name, c.email, ro.revenue
FROM recent_orders ro
JOIN customers c ON c.id = ro.customer_id
ORDER BY ro.revenue DESC
LIMIT 5;

-- Then: EXPLAIN ANALYZE this. Is it using the index on created_at?

05 · Networking Fundamentals

What happens when you type a URL and press Enter?

1.  Browser checks DNS cache for IP address
2.  DNS resolution: recursive resolver → root → TLD → authoritative → IP
3.  TCP 3-way handshake (SYN → SYN-ACK → ACK)
4.  TLS handshake: negotiate cipher, exchange certs, derive session keys
5.  Browser sends HTTP GET request
6.  Request hits load balancer → routes to app server
7.  App server processes, queries DB, builds response
8.  HTTP response sent back (200 OK + body)
9.  Browser renders / client parses the response
10. Connection kept alive (HTTP/1.1 keep-alive, HTTP/2 multiplexing)

Know each step. Every one can fail. Every one can be optimized.

TCP vs UDP

TCP: reliable, ordered, slower (web, email, APIs). UDP: fast, lossy, real-time (video, gaming, DNS queries). Know which to reach for.

HTTP Methods

GET=read (idempotent), POST=create, PUT=replace, PATCH=update, DELETE=remove. Using the wrong verb breaks caching and semantics.

TLS / HTTPS

Encryption in transit. TLS handshake: client hello → server cert → key exchange → encrypted channel. Never send credentials over plain HTTP.

DNS

Domain → IP mapping. Recursive resolver → root → TLD → authoritative. TTL controls caching. Understand this to debug 503s and slow starts.

HTTP Status Codes

2xx=success, 3xx=redirect, 4xx=client error, 5xx=server error. 200/201/204/301/400/401/403/404/422/429/500/502/503. Know all of these cold.

Load Balancing

L4 (TCP) vs L7 (HTTP). Round-robin, least-connections, consistent hashing. Health checks. Session affinity. The gateway to horizontal scale.

06 · Linux & the Command Line

⚠️ The Scenario: SSH into a broken production server at 2am. No GUI. No IDE. Just a terminal and your knowledge. Prepare now.

# Log investigation — most critical production skill
tail -f /var/log/app/error.log         # live log stream
journalctl -u myapp -n 100             # last 100 lines
grep -A5 -B5 "Exception" app.log       # context around errors

# Disk & network
df -h                                  # full disk = silent failures
ss -tlnp                               # what ports are listening?
curl -v https://api.example.com/health # verbose HTTP check

Navigation

ls -la

List files with permissions and hidden files

find . -name "*.log"

Find files by pattern

cd -

Jump back to previous directory

Processes

ps aux | grep python

Find running Python processes

htop

Interactive process viewer

kill -9 PID

Force-kill a process

Logs & Text

tail -f app.log

Live log stream

grep -r "ERROR" /var/log/

Recursive log search

awk '{print $1}' access.log | sort | uniq -c

Count occurrences

Network

curl -v https://api.example.com

Verbose HTTP request

ss -tlnp

Show listening ports

traceroute google.com

Trace network hops

🛠 Practice Projects

Project 1 — DSA Task Scheduler: Build a scheduler using a heap, hash map, and queue. Implement MinHeap from scratch. Write tests that verify Big-O behavior with timing.

Project 2 — CLI Tool: Build something useful (CSV processor, log analyzer). Full type hints, >80% test coverage, every error path handled. Write a README as if a stranger will use it.

Project 3 — REST API + Database: CRUD API with PostgreSQL. Correct HTTP verbs and status codes. EXPLAIN ANALYZE your most important queries.

Project 4 — Linux Server: Spin up a free VM. Install Nginx. Create a non-root user, disable root SSH. Write a cron job for disk monitoring. Simulate a failure and diagnose from logs.

✅ Mastery Checklist

0 / 12 complete0%

Implement 5+ data structures from scratch: array, hash map, stack, queue, heap

Solve 30+ algorithm problems — write the pattern name after each solution

Explain Big-O for every data structure operation without looking it up

Build a project using stack, heap, and hash map together

Decompose two real systems (e.g. Instagram, Spotify) into components

Write a 300+ line typed project with full error handling and tests

Design a relational database schema and write 5 non-trivial queries

EXPLAIN ANALYZE at least 3 queries and add missing indexes

Explain the full lifecycle of an HTTP request from URL to response

Diagnose a simulated Linux server problem entirely from the CLI

Write a bash script that automates a real task

Deploy an app to a Linux server over SSH with Nginx

System design is engineering's most transferable skill. It's how you think at scale — anticipating failure, bottlenecks, and tradeoffs before writing a single line of code. Every decision has a cost. A great engineer knows what cost they're paying.

📍 Module Overview

Property	Value
Phase	Months 3–6
Prerequisite	Module 01 — Engineering Foundations
Core Topics	6
Key Mindset Shift	"Make it work" → "Make it work at 100x scale with graceful failure"

🧠 The Systems Thinker's Mindset

JUNIOR ENGINEER sees:         SYSTEMS ENGINEER sees:
  "I need to store user data"   "What's the read/write ratio?
                                 How many users? In 2 years?
                                 What's acceptable latency?
                                 What happens when the DB is slow?"

Every production system is eventually more traffic than designed for, running on hardware that fails, used in ways nobody anticipated, maintained by engineers who weren't there when it was built. Design for that reality from the start.

01 · The 7-Step Design Framework

Back-of-envelope estimation

URL SHORTENER — estimating scale
Assumptions: 100M DAU, read:write ratio = 10:1

Writes: 10M/day → 116 writes/sec → 350 peak
Reads:  100M/day → 1,160 reads/sec → 3,500 peak ← hot path
Storage: 18.25B records × 500B = 9TB over 5 years

Conclusions:
  → Reads >> Writes — optimise read path aggressively
  → Cache hot URLs in Redis (expect 90% hit rate)
  → Single DB handles writes; read replicas needed

02 · Core Tradeoffs

Strong engineers don't find "the right answer" — they find the right answer for this context.

⚠️ After every design decision: "By choosing X, I'm accepting Y." If you can't complete that sentence, you don't understand the tradeoff yet.

Strong Consistency

Every read returns the most recent write. All nodes see the same data.

WINS FOR

+Banking, payments

+Inventory systems

+User auth

COSTS

−Higher latency

−Unavailable during partition

High Availability

System always responds, even if data might be slightly stale.

WINS FOR

+Social feeds

+Metrics/analytics

+Shopping carts

COSTS

−Stale reads possible

−Complex conflict resolution

Engineer's TakeMost consumer apps choose availability. Money systems choose consistency. Know your business requirement before picking a DB.

Non-Functional Requirements

NFR	Measured By	Typical Target	How to Achieve
Availability	Uptime %	99.9% = 8.7h/yr downtime	Redundancy, health checks, auto-restart
Latency	p50/p95/p99 ms	p99 < 200ms for APIs	Caching, CDN, async processing
Throughput	Requests/sec	Depends on scale	Horizontal scaling, load balancing
Durability	Data loss tolerance	0 data loss for financial	Replication, backups, WAL
Scalability	Growth headroom	10× current load	Stateless services, sharding

03 · Distributed Systems

⚠️ The 8 Fallacies of Distributed Computing: The network is reliable · Latency is zero · Bandwidth is infinite · The network is secure · Topology doesn't change · There is one administrator · Transport cost is zero · The network is homogeneous. Design around every one of these.

Horizontal vs Vertical Scaling

Vertical: bigger machine. Horizontal: more machines. Vertical has a ceiling. Horizontal requires stateless services. Design for horizontal from the start.

Scale-outStateless

Replication

Copies of data on multiple nodes. Leader-Follower: one write node, others replicate. Multi-Leader: multiple write points. Leaderless: any node accepts (Cassandra).

Leader-FollowerConsistency

Sharding / Partitioning

Split data across nodes by a shard key. Hash sharding: even distribution. Range sharding: supports range queries. Hot shards are the biggest operational headache.

Shard KeyHot Spots

Consensus

How distributed nodes agree on a single value. Raft and Paxos are the gold-standard algorithms. Enables distributed locks, leader election, consistent transactions.

RaftPaxos

Event-Driven Architecture

Services communicate via events, not direct calls. Producers emit events. Consumers process independently. Decoupled, resilient, but harder to trace.

KafkaPub/SubDecoupling

Circuit Breaker

Prevent cascading failures. If downstream is failing, stop calling it — return a fallback. After a timeout, test if it's back. Fail fast, fail safe.

ResilienceFail Fast

# The 3 questions for every distributed operation:

# 1. What if the network drops mid-request?
def transfer_right(from_acc, to_acc, amount):
    txn_id = generate_uuid()
    with transaction(txn_id):
        debit(from_acc, amount)
        credit(to_acc, amount)
    # Safe to retry — duplicate txn_id is a no-op

# 2. What if two nodes do the same thing simultaneously?
# Optimistic lock: only succeeds if version matches
UPDATE inventory
   SET qty = :qty, version = version + 1
 WHERE id = :id AND version = :expected_version

# 3. What if a downstream service is slow?
circuit_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=30,
    fallback=lambda: cached_response()
)

04 · Caching

💡 Phil Karlton: "There are only two hard things in Computer Science: cache invalidation and naming things." This is funny because it's true.

Cache-Aside (Lazy Loading)

FLOW

App checks cache → MISS

App fetches from DB

App writes result to cache

Next request → HIT

PROS

+Only caches what's actually read

+Cache failure doesn't break app

CONS

−First request always slow (cold start)

−Cache stampede on popular miss

USE WHEN

General purpose. Most common pattern.

# Redis caching with stampede protection + TTL jitter
def get_user_profile(user_id: str) -> dict:
    cached = redis.get(f"user:{user_id}")
    if cached: return json.loads(cached)

    lock = redis.set(f"lock:user:{user_id}", "1", nx=True, ex=5)
    if lock:
        try:
            profile = db.query("SELECT * FROM users WHERE id = %s", user_id)
            ttl = 3600 + random.randint(-360, 360)  # jitter prevents mass expiry
            redis.setex(f"user:{user_id}", ttl, json.dumps(profile))
            return profile
        finally:
            redis.delete(f"lock:user:{user_id}")
    else:
        time.sleep(0.1)
        return get_user_profile(user_id)

05 · Security

⚠️ Security isn't a feature you add later — it's a perspective you embed from day one. Threat modelling happens at the whiteboard, not in a penetration test after launch.

	Authentication (AuthN)	Authorization (AuthZ)
Question	"Who are you?"	"What can you do?"
Methods	Passwords, OAuth2, SSH keys, MFA	RBAC, ABAC, ACLs, policies
Order	Always first	Always second

Principle of Least Privilege

Every service, user, and process should have only the permissions it needs. An API that only reads should never have write credentials.

Defence in Depth

Layer your defences: firewall → WAF → AuthN → AuthZ → input validation → rate limiting → audit logging. One layer failing ≠ full compromise.

Threat Modelling (STRIDE)

Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege. Ask these for every component.

Zero Trust Architecture

Never trust, always verify — even internal traffic. Every request must be authenticated and authorised. Service-to-service auth via mTLS.

06 · Observability

💡 The Three Pillars: Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where the time went. You need all three.

📊

Metrics

Numerical measurements over time. The "what is happening right now" signal.

EXAMPLES

·Request rate (QPS)

·Error rate (%)

·p50/p95/p99 latency

·Cache hit rate

PrometheusDatadogCloudWatch

📝

Logs

Timestamped records of discrete events. The "what happened and why" signal.

EXAMPLES

·Request/response details

·Error stack traces

·Audit trail

·Business events

ElasticsearchLokiSplunk

🔍

Traces

The path of a request through distributed services. The "where is the time going" signal.

EXAMPLES

·Service call tree

·DB query timing

·External API calls

JaegerZipkinDatadog APM

The Four Golden Signals (Google SRE)

Signal	What It Measures	Why It Matters
Latency	Time to serve a request (p50/p95/p99)	Directly impacts user experience
Traffic	Volume of demand (QPS)	Capacity planning, anomaly detection
Errors	Rate of failing requests (5xx)	Most direct user-impact signal
Saturation	How "full" the service is (CPU %, queue depth)	Leading indicator — alerts before breakage

🛠 Design Exercises

Exercise 1 — URL Shortener: Focus on hash function choice, redirect latency, analytics counting without blocking redirects.

Exercise 2 — Twitter Feed: The fan-out problem. How do you handle celebrities with 100M followers? Push vs pull tradeoff.

Exercise 3 — Rate Limiter: Token bucket vs sliding window. Distributed rate limiting across multiple app servers.

Exercise 4 — Notification System: Multi-channel (email, SMS, push). Priority queues. Deduplication. Retry with backoff.

✅ Mastery Checklist

0 / 12 complete0%

Can walk through all 7 design steps without prompting

Estimated scale (QPS, storage, bandwidth) for 3+ different systems

Articulate 5 core tradeoffs and when to choose each side

Designed a complete system end-to-end (URL shortener, Twitter, or similar)

Implemented cache-aside with TTL jitter and stampede protection

Can explain the CAP theorem and name a real DB on each side

Know the difference between AuthN and AuthZ with concrete examples

Know the OWASP Top 10 and can describe an example of each

Set up metrics, logs, and traces for a production-like application

Can define SLI, SLO, and SLA and calculate an error budget

Designed for at least 2 failure modes: node failure and cascading failure

Written a post-mortem for a real or simulated incident with root cause

Property	Value
Phase	Months 4–8
Prerequisite	Modules 01–02 (Foundations + Systems Design)
Core Topics	6
Primary Tools	AWS / GCP / Azure · Docker · Kubernetes · Terraform · GitHub Actions
Key Mindset Shift	"I provision infrastructure" → "I declare desired state"

Old Thinking	Cloud Thinking
"I need a server"	"I need compute for this workload"
"SSH in and configure it"	"Declare desired state as code"
"Deployment = FTP to a server"	"Deployment = triggering a pipeline"
"Scaling = bigger machine"	"Scaling = changing a number in config"
"DR = manual backup"	"DR = `terraform apply`"

Model	You Manage	Provider Manages	Examples
IaaS	OS, runtime, app, data	Hardware, network	EC2, GCE, Azure VMs
PaaS	App code and data	OS, runtime, scaling	Heroku, App Engine
SaaS	Configuration only	Everything	Salesforce, GitHub

K8s Resource	What It Does	When You Use It
Pod	Smallest deployable unit (1+ containers)	Never directly; use Deployment
Deployment	Manages replicas, rolling updates, rollbacks	Every stateless service
Service	Stable DNS + load balancing across pods	Every service receiving traffic
Ingress	HTTP routing, TLS termination	Exposing services to the internet
ConfigMap	Non-sensitive config	Feature flags, app config
Secret	Sensitive config (base64 + encrypted at rest)	API keys, DB passwords, TLS certs
HPA	Horizontal Pod Autoscaler	Variable traffic workloads

Strategy	How it works	Risk	Use When
Rolling	Replace pods one by one	Medium	Standard production deploys
Blue/Green	Two envs, swap DNS	Low (instant rollback)	Zero-downtime critical deploys
Canary	Route 5% to new version	Very low	High-risk changes, large user base
Feature Flags	Deploy dark, enable per-user	Minimal	A/B testing, progressive rollout

Resource	Common Mistake	Fix
EC2	Oversized instances, always running	Right-size + Spot instances
Data Transfer	Cross-region egress	Co-locate services, use CDN
NAT Gateway	All private subnet traffic	VPC endpoints for S3/DynamoDB
RDS	Multi-AZ dev databases	Single AZ for dev/staging
Unattached EBS	Orphaned volumes	Tag + audit regularly

You don't need to be a researcher. You need to understand AI as a building block — when to use it, what it costs, how it fails, and how to integrate it reliably into production systems. AI is infrastructure now. Treat it like infrastructure.

📍 Module Overview

Property	Value
Phase	Months 6–12
Prerequisite	Modules 01–03 (Foundations, Systems Design, Cloud)
Primary Tools	Python · PyTorch · scikit-learn · OpenAI/Anthropic APIs · Pinecone · MLflow
Key Mindset Shift	"AI is magic" → "AI is a probabilistic function with known failure modes"

🧠 The Engineer's Mental Model for AI

A machine learning model is a function:
  f(input) → output + confidence

It was created by:
  optimizer(loss_function(f(training_data), labels)) → parameters

At inference, it's just matrix multiplications + non-linearities.

Three questions before using AI:

1. IS AI THE RIGHT TOOL?
   Could a deterministic function solve this instead?
   Spend one week trying rules/heuristics first.

2. WHAT DOES FAILURE LOOK LIKE?
   How does this model fail? How often?
   What happens downstream when it fails?

3. HOW DO I MEASURE SUCCESS?
   What's my evaluation metric?
   How do I know if the model degrades in production?

01 · ML Fundamentals

SUPERVISED:     (input, label) pairs → learn f(input) → label
UNSUPERVISED:   inputs only → find structure / patterns
REINFORCEMENT:  agent + environment + rewards → learn policy
SELF-SUPERVISED: labels from the data itself → GPT, BERT, CLIP

# Every ML training loop is this, at its core:
for epoch in range(num_epochs):
    for batch in dataloader:
        predictions = model(batch.inputs)
        loss = loss_function(predictions, batch.labels)
        loss.backward()       # which params caused the error?
        optimizer.step()      # step in direction that reduces loss
        optimizer.zero_grad()
# BatchNorm, dropout, attention — all improvements on this.

Loss Function

Measures how wrong the model's predictions are. Training minimises this. Cross-entropy for classification, MSE for regression.

Gradient Descent

The optimisation algorithm. Calculate the gradient (direction of steepest error increase), step in the opposite direction. Repeat.

Learning Rate

Step size for each gradient update. Too high: overshoots minimum. Too low: takes forever. Learning rate schedules adjust it during training.

Epoch & Batch

Epoch = one full pass through training data. Batch = subset of data processed together. Mini-batches balance speed and stability.

Regularisation

Techniques to prevent overfitting: L1/L2 weight penalties, dropout (randomly zero-out neurons), early stopping, data augmentation.

Evaluation Metrics

Accuracy, precision, recall, F1 for classification. RMSE, MAE for regression. AUC-ROC for ranking. Choose based on what failure costs.

02 · LLMs & Prompt Engineering

LLMs are next-token predictors. They're not retrieving facts — they're generating plausible continuations. This means they hallucinate, are sensitive to phrasing, and have a hard context window limit.

# ❌ Vague prompt — you get vague output
response = llm.complete("Tell me about this code")

# ✓ Structured prompt — clear contract
response = llm.complete("""
You are a senior engineer reviewing a pull request.
Review for: correctness, edge cases, performance, readability.
Output ONLY as JSON:
  { "verdict": "approve|request_changes",
    "issues": [{"severity": "high|medium|low", "description": "..."}],
    "summary": "..." }
Code: {code}
""")

Pattern	Use When	Example Instruction	Why It Works
Zero-shot	Simple, clear tasks	"Classify this review as positive/negative."	Start here. Often enough.
Few-shot	Specific format or edge cases	"Here are 3 examples of the format I want: [...] Now do this one:"	Show, don't just tell.
Chain-of-Thought	Reasoning, multi-step problems	"Think step by step before answering."	Improves accuracy on hard problems.
Structured Output	Downstream processing	"Respond only as JSON with keys: verdict, issues, summary."	Essential for production integrations.

RAG — Retrieval-Augmented Generation

# RAG pipeline in ~30 lines
def rag_query(question: str) -> str:
    # 1. Embed the question
    embedding = openai.embeddings.create(
        input=question, model="text-embedding-3-small"
    ).data[0].embedding

    # 2. Find semantically similar chunks
    results = pinecone_index.query(vector=embedding, top_k=5, include_metadata=True)
    context = "\n\n".join([r.metadata["text"] for r in results.matches])

    # 3. Generate grounded answer
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based ONLY on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    ).choices[0].message.content

⚠️ Agent Warning: Agents that take real-world actions (sending emails, running code, calling APIs) are powerful but dangerous. Always implement human-in-the-loop checkpoints for irreversible actions.

03 · Vector Databases & Embeddings

An embedding is a numerical representation of meaning. Similar meanings → similar vectors. "king" − "man" + "woman" ≈ "queen" — this is actual vector arithmetic.

Pinecone

Managed

+ Fastest to production, no ops burden

− Cost at scale, vendor lock-in

pgvector

PostgreSQL ext

+ You already use Postgres, simple needs

− Slower ANN search at large scale

Weaviate

OSS / Managed

+ Hybrid search (vector + keyword)

− Heavier to self-host

Chroma

OSS / Embedded

+ Local dev, prototyping, small datasets

− Not production-scale

Qdrant

OSS / Cloud

+ High performance, rich filtering

− Newer, smaller community

04 · MLOps — AI in Production

NOTEBOOK MODEL:               PRODUCTION ML SYSTEM:
  ✓ Runs on your laptop         ✓ Containerised, deployed, versioned
  ✓ Uses static dataset         ✓ Served at low latency
  ✗ Not versioned               ✓ Automatically retrained on fresh data
  ✗ Not monitored               ✓ Monitored for drift and degradation
  ✗ Not reproducible            ✓ Rollback capability

# Detect data drift — the silent model killer
from scipy import stats

def detect_drift(training_dist, production_dist, threshold=0.05):
    _, p_value = stats.ks_2samp(training_dist, production_dist)
    return {
        "drift_detected": p_value < threshold,
        "p_value": p_value,
        "action": "Investigate and consider retraining" if p_value < threshold else "OK"
    }

Category	Tools	What It Solves
Experiment Tracking	MLflow, Weights & Biases, Neptune	Reproducibility — track params, metrics, artefacts per run
Model Registry	MLflow, Hugging Face Hub, SageMaker	Versioned model storage, promotion workflows (staging → prod)
Feature Store	Feast, Tecton, Vertex AI Feature Store	Consistent features between training and serving
Model Serving	FastAPI + Docker, BentoML, Seldon, TorchServe	Low-latency inference endpoints with scaling
Pipeline Orchestration	Airflow, Kubeflow, Prefect, ZenML	Automated data→train→evaluate→deploy pipelines
Monitoring	Evidently, WhyLabs, Arize, Prometheus	Drift detection, data quality, performance degradation
CI/CD for ML	GitHub Actions + DVC, CML	Automated testing, model evaluation on every commit

05 · AI Safety & Responsible Engineering

AI safety failures are production incidents. Bias in a loan model causes financial harm. Hallucinations in a medical assistant can be dangerous. These are engineering requirements.

OWASP Top 5 for LLMs:
1. PROMPT INJECTION       — input hijacks model behaviour
2. INSECURE OUTPUT        — LLM output executed without validation
3. TRAINING DATA POISONING — malicious training data corrupts model
4. MODEL DoS              — crafted inputs cause excessive compute
5. EXCESSIVE AGENCY       — LLM given too many real-world permissions

Fix #5: Principle of least privilege + human-in-the-loop
        for any irreversible action.

06 · Applied AI — When to Use What

AI Approach Decision Tree

Does the problem have clear patterns in historical data?

Decision: Prompt → RAG → Fine-tune

Does a well-crafted prompt solve it?     YES → Ship it.
Does the model need external knowledge?  YES → Add RAG.
Need specific style / vocabulary?        YES → Consider fine-tuning.
Latency / cost at massive scale?         YES → Fine-tune a smaller model.
Otherwise → Re-examine if you need ML at all.

✅ Mastery Checklist

0 / 14 complete0%

Trained a classifier end-to-end: data → model → evaluation → iteration

Can read a loss curve and diagnose overfitting vs underfitting

Built and evaluated a RAG pipeline (chunking, embedding, retrieval, generation)

Measured retrieval quality with recall@k on a test set

Written a systematic eval suite with >20 test cases for an LLM application

Implemented at least 2 prompt engineering patterns: few-shot and chain-of-thought

Served a model via a REST API with proper error handling and timeouts

Set up experiment tracking (MLflow or W&B) with parameter and metric logging

Implemented drift detection on a deployed model's input distribution

Built a CI pipeline that runs model evals on every commit

Can explain the OWASP Top 5 for LLMs with a concrete example of each

Calculated and tracked cost-per-inference for a production LLM endpoint

Completed fast.ai Practical Deep Learning Part 1

Watched Karpathy's "Let's build GPT from scratch"

Code is read far more than it is written. On a living codebase, the same lines are read hundreds of times — by teammates, future-you, and engineers who haven't joined yet. Write for the reader, not the compiler.

📍 Module Overview

Property	Value
Phase	Ongoing — apply from day one
Prerequisite	Modules 01–02 (Foundations, Systems Design)
Core Topics	7
Key Mindset Shift	"Does it work?" → "Will this be easy to change in 6 months?"

🧠 The Craft Mindset

Code that WORKS:          Code that is CRAFTED:
  ✓ Correct behaviour       ✓ Correct behaviour
  ? Understandable          ✓ Reads like well-written prose
  ? Testable                ✓ Has a test for every behaviour
  ? Changeable              ✓ Easy to extend, hard to break
  ? Maintainable            ✓ Deletes itself as requirements change

💡 The Craftsperson's Rule: Leave the code better than you found it. Every time you touch a file, improve one thing. Rename one variable. Extract one function. These compound into a dramatically cleaner codebase over months.

01 · Code as Communication

# ❌ Opaque — reader must decode every symbol
def proc(d, f, t):
    r = []
    for x in d:
        if x[f] > t: r.append(x)
    return r

# ✓ Expressive — reads like a sentence
def filter_records_above_threshold(
    records: list[dict],
    field: str,
    threshold: float
) -> list[dict]:
    return [r for r in records if r[field] > threshold]

Variables

Name for meaning, not type.

GOOD

user_email

is_authenticated

items_in_cart

retry_count

AVOID

str1

flag

count

data

Functions

Functions are verbs — they do something.

GOOD

get_user_by_id()

validate_checkout_form()

send_welcome_email()

AVOID

process()

handle_it()

do_stuff()

Classes

Classes are nouns — they represent a thing.

GOOD

PaymentProcessor

UserSession

OrderRepository

AVOID

Manager

Handler

Processor

Booleans

Booleans read as questions.

GOOD

is_admin

has_paid

can_edit

was_notified

AVOID

status

flag

active

value

# ❌ One function doing five things — impossible to test or reuse
def process_order(order):
    if not order.items: raise ValueError("Empty order")
    if order.user.is_premium: order.total *= 0.9
    charge = stripe.charge(order.total, order.user.card)
    send_email(order.user.email, f"Order {order.id} confirmed")
    db.save(order)
    return order

# ✓ Each function has one job — independently testable
def process_order(order: Order) -> Order:
    validate_order(order)
    apply_discount(order)
    charge_payment(order)
    notify_user(order)
    persist_order(order)
    return order

02 · SOLID Principles

SOLID isn't a checklist — it's five lenses for evaluating whether your code will be easy to change, test, and understand six months from now.

💡 How to use SOLID: Apply it when you feel pain — a class that keeps changing, a test that needs a dozen mocks, a function you're afraid to touch. The pain is the signal; SOLID is the diagnosis tool.

Single Responsibility Principle

"A class should have one reason to change."

Every class, function, and module should do exactly one thing and do it well. If a class handles both business logic and database access, changes to the database force changes to the business logic — coupling you never wanted.

✗ Violating

class User:
    def save_to_db(self):          # DB concern
        db.execute("INSERT ...")
    def send_welcome_email(self):  # Email concern
        smtp.send(self.email, ...)
    def to_json(self):             # Serialisation concern
        return json.dumps(...)

✓ Applying

class User:
    pass  # just the domain object

class UserRepository:    # DB concern
    def save(self, user: User): ...

class UserNotifier:      # Email concern
    def send_welcome(self, user): ...

class UserSerializer:    # Serialisation
    def to_json(self, user): ...

03 · Design Patterns

Design patterns are a shared vocabulary for recurring design problems. When a senior engineer says "use a Strategy here", the whole team instantly understands the shape of the solution.

⚠️ Don't Pattern-Match Prematurely: Write the simplest code first. Extract a pattern when you see the same problem recurring — not to seem sophisticated.

Factory Method

USE WHEN: You need to create objects but the subclass decides which class.

class NotificationFactory:
    @staticmethod
    def create(type: str) -> Notification:
        match type:
            case "email": return EmailNotification()
            case "sms":   return SMSNotification()
            case "push":  return PushNotification()

04 · Architecture

Architecture is the set of decisions that are hard to change later.

Architecture	Best For	Core Idea	Complexity
Monolith	Startups, small teams	Single deployable unit	Low
Modular Monolith	Growing teams	Monolith with enforced boundaries	Medium
Hexagonal / Clean	Domain-heavy apps	Business logic isolated from infra	Medium
Microservices	Large orgs, independent scaling	Services split by domain	High

⚠️ The Monolith Rule: Start with a monolith. Always. Microservices solve organisational and scaling problems you don't have yet. Premature decomposition is the most expensive architectural mistake in the industry.

Clean Architecture

"Business logic at the centre, infrastructure at the edges."

Dependencies point inward. The domain layer knows nothing about databases, HTTP, or frameworks. This makes your domain logic independently testable and infrastructure replaceable.

Domain / Entities

Core business rules. Pure code. No framework imports. No DB imports.

Use Cases

Application-specific business rules. Orchestrates domain entities.

Interface Adapters

Converts data between use cases and the outside world. Controllers, presenters, repositories.

Frameworks & Drivers

Outermost ring. Web framework, database, external APIs. Replaceable.

05 · Testing Philosophy

Tests are not just verification — they are executable documentation, design feedback, and a safety net all at once.

💡 Tests as Design Feedback: If your code is hard to test, that's not a testing problem — it's a design problem. Tight coupling, missing abstractions, mixed concerns all show up as pain in tests first.

Testing Pyramid — ideal distribution

E2E Tests

10%

Integration Tests

25%

Unit Tests

65%

Full system tests. Slow, brittle, expensive. Test critical user journeys only. Keep very few.

Test components together. DB + service, API + handler. Slower than unit, faster than E2E.

Test a single unit in isolation. Fast, deterministic, cheap. The bulk of your suite.

# TDD: write the test FIRST — design emerges from tests

# Step 1 (RED): write failing test
def test_calculate_tax_raises_on_negative():
    with pytest.raises(ValueError):
        calculate_tax(-100, 0.2)  # function doesn't exist yet

# Step 2 (GREEN): write minimum code to pass
def calculate_tax(amount: float, rate: float) -> float:
    if amount < 0:
        raise ValueError("Amount cannot be negative")
    return amount * rate

# Step 3 (REFACTOR): clean up while keeping tests green

06 · Refactoring & Technical Debt

Refactoring is changing the internal structure of code without changing its external behaviour. It's not rewriting. It's disciplined, incremental improvement — with tests as your safety net.

Long Method

SIGN: Method > 20 lines, needs comments to separate sections

Fix: Extract smaller named methods. If you need a comment to separate sections, that's a method boundary.

God Class

SIGN: One class knows or does everything

Fix: Apply SRP. Extract collaborators. A class should fit on one screen and have one reason to change.

Deep Nesting

SIGN: if/for/try nested 3+ levels deep

Fix: Early returns (guard clauses). Extract inner loops to functions. Invert conditionals.

Primitive Obsession

SIGN: Using strings/ints for rich concepts: user_id as int, email as str

Fix: Create value objects: UserId, EmailAddress. They carry validation and prevent accidental misuse.

Feature Envy

SIGN: Method uses more data from another class than its own

Fix: Move the method to the class whose data it envies. Data and behaviour belong together.

Shotgun Surgery

SIGN: One change requires edits in 10+ different files

Fix: Cohesion problem. Move related code together. A change should touch one file, not ten.

# ❌ Arrow anti-pattern — hard to reason about
def process_payment(user, order):
    if user is not None:
        if user.is_active:
            if order.total > 0:
                if user.has_payment_method:
                    charge(user, order)

# ✓ Guard clauses — happy path is unobscured
def process_payment(user, order):
    if user is None:               raise UserNotFound()
    if not user.is_active:         raise InactiveUser()
    if order.total <= 0:           raise EmptyOrder()
    if not user.has_payment_method: raise NoPaymentMethod()
    charge(user, order)

Type	How It Happens	Strategy
Deliberate / Prudent	Conscious shortcut for a deadline	Document it. Schedule paydown.
Deliberate / Reckless	"We'll clean it up later"	Never. Short-termism kills codebases.
Inadvertent / Prudent	Learned better patterns after writing	Refactor when you revisit the area.
Inadvertent / Reckless	Didn't know better at the time	Invest in team education + code review culture.

07 · API Design

Your API is a promise. A public API is a permanent promise — once consumers depend on it, changing it has a cost.

REST Principles:
  Use nouns, not verbs:   /users not /getUsers
  HTTP verbs carry meaning: GET=read, POST=create, PATCH=update, DELETE=remove
  Pluralise consistently: /users, /orders, /products
  Version from day one:   /v1/users
  Return correct status:  201 Created, 404 Not Found, 422 Unprocessable

Well-designed error response:
  HTTP 422
  {
    "error": {
      "code": "VALIDATION_FAILED",
      "message": "Request validation failed",
      "details": [
        { "field": "email", "code": "INVALID_FORMAT",
          "message": "Must be a valid email address" }
      ],
      "request_id": "req_abc123"
    }
  }

REST

HTTP verbs + nouns. Stateless. Universal. Safe default for most public APIs and CRUD resources.

GraphQL

Client queries exactly the data it needs. Single endpoint. Great for mobile, complex frontends, multiple client types.

gRPC

Binary protocol (protobuf). Strongly typed. Fast. Best for service-to-service internal APIs and polyglot microservices.

🛠 Practice: The Gilded Rose Kata

Clone github.com/emilybache/GildedRose-Refactoring-Kata. First write a characterisation test suite capturing current behaviour. Then refactor ruthlessly — tests are your safety net. Finally, add the Conjured items feature cleanly.

✅ Mastery Checklist

0 / 14 complete0%

Can explain "why not what" comments with concrete before/after examples

Refactored opaque naming to expressive naming in a real project

Can explain all 5 SOLID principles and identify a violation in a real codebase

Applied Open/Closed to extend behaviour without modifying existing code

Used Dependency Inversion to make a class testable without real infrastructure

Identified and applied 3 design patterns by recognising the problem they solve

Designed or refactored a system using Clean or Hexagonal Architecture

Written a unit test suite with >80% coverage for a real module

Practised TDD on a feature — wrote the failing test before any implementation

Identified 3 code smells in a real codebase and refactored them

Applied guard clauses to flatten deeply nested code

Designed a REST API with correct verbs, status codes, and structured errors

API is versioned and returns actionable error objects with request_id

Completed the Gilded Rose refactoring kata

Technical skill gets you in the room. Communication determines what you do once you're there. The engineers who communicate well get to work on the most important problems, lead the most impactful projects, and shape the direction of the systems they build.

📍 Module Overview

Property	Value
Phase	Ongoing — practice from day one
Prerequisite	None — start immediately alongside technical modules
Core Topics	6
Primary Outputs	Design docs · RFCs · Code reviews · Presentations · ADRs
Key Mindset Shift	"I just write code" → "I build shared understanding"

🧠 Communication is Leverage

Your code will be read by ~5 engineers over its lifetime.
Your design doc will shape decisions for 50 engineers.
Your RFC will influence architecture used by 500 users.
Your talk will change how 5,000 people think about a problem.

Communication has leverage. Code has limits.

01 · Technical Writing

💡 The Test: Write a one-paragraph plain English description of your system. If you can't, go back to the design. The writing problem is a design problem.

RFC (Request for Comments)

You want to propose a significant technical change and surface disagreement early

Required Sections

›Problem

›Context

›Proposed Solution

›Alternatives Considered

›Impact

›Open Questions

Golden Rule

Write the "Alternatives Considered" section before the "Proposed Solution". Forces honest evaluation.

Writing a great RFC

# RFC-042: Replace synchronous email sending with async queue

## Status
PROPOSED | 2024-01-15 | Author: @yourname

## Problem
Our checkout endpoint has a p99 latency of 2,400ms.
Profiling shows 1,800ms is waiting for SendGrid API calls.

## Proposed Solution
Replace synchronous email sending with a job queue:
  1. Checkout writes email job to Redis queue (< 5ms)
  2. Worker pool processes jobs asynchronously
  3. Retry with exponential backoff on failure

## Alternatives Considered
A. Use SendGrid's batch API — rejected: doesn't fix latency issue
B. Move emails to Lambda — rejected: adds cold start complexity

## Impact
- Expected p99 improvement: 2,400ms → 600ms
- Email delivery delay: 0ms → < 5 seconds (acceptable per PM)
- Risk: email failures become silent — need alerting

## Open Questions
1. Should we track email job status in the DB for support queries?

Writing a great ADR

# ADR-017: Use PostgreSQL for primary data store

## Date: 2024-01-10 | Status: Accepted

## Context
Relational data (users → orders → items), complex reporting,
small team (3 engineers), financial operations requiring ACID.

## Decision
PostgreSQL 16 as primary data store.

## Rationale
- Relational data — joins are frequent, not an afterthought
- Team has strong SQL expertise — no learning curve
- JSONB covers flexible-schema edge cases

## Revisit When
- User count exceeds 5M AND write QPS exceeds 5,000

02 · Presenting Technical Work

BEFORE WRITING A SINGLE SLIDE, ANSWER:

Who is in the room?
  Engineers only → go deep on implementation
  Mixed → lead with impact, details later
  Executives → business outcome, skip implementation

What do you need them to DO after?
  Approve a proposal → give them the decision, not the journey
  Understand a system → give them the mental model, not the code

System Context

Your system as a single box, its users, and external systems it talks to

AUDIENCE: Everyone, including non-technical

Kick-off meetings, stakeholder updates

Container

Applications and databases inside the system (web app, API, DB, cache)

AUDIENCE: Technical leads, architects

Architecture discussions, onboarding engineers

Component

Components inside a container (controllers, services, repositories)

AUDIENCE: Developers

Design reviews, documenting complex services

Code

Classes, functions, modules inside a component

AUDIENCE: Developers (on demand)

Only when the code itself isn't clear enough

sequenceDiagram
    participant U as User
    participant A as API Gateway
    participant S as Auth Service
    participant D as Database

    U->>A: POST /checkout (JWT token)
    A->>S: Validate token
    S-->>A: User ID + permissions
    A->>D: Begin transaction
    A->>D: Deduct inventory + Create order
    D-->>A: Order ID
    A-->>U: 201 Created {order_id}

03 · Code Review Culture

Code review is not gatekeeping — it's teaching. The purpose: catch bugs, share knowledge, maintain consistency, create an audit trail. It is not to prove you're smarter than the author.

Label your feedback

✗ Don't

""This should probably be different""

✓ Do

""SUGGESTION: consider using a map here for O(1) lookup instead of O(n) linear scan""

Why: The author knows whether to block on this or take it as optional.

Ask questions instead of demanding

✗ Don't

""This is wrong, fix it.""

✓ Do

""What happens if user is null here? Should we guard against that?""

Why: Leaves room for the author to have a reason you don't know about.

Explain the why

✗ Don't

""Don't do it this way.""

✓ Do

""Avoid mutable default args in Python — this dict is shared across all calls. Use None and initialise inside the function.""

Why: The author learns, not just changes.

Acknowledge good work

✗ Don't

"(silence on everything well-done)"

✓ Do

""Nice! The guard clauses make this much easier to follow than the original.""

Why: Positive reinforcement makes people receptive to the critical feedback.

BLOCKING:    Must change before merge — bug, security issue, contract break
SUGGESTION:  Strong recommendation — SRP violation, missing test coverage
NIT:         Take it or leave it — style, naming preference

Always label your feedback. The author needs to know the stakes.

04 · Stakeholder Management

Technical Reality	Business Translation	Why It Lands
p99 latency increased from 200ms to 2s	The slowest 1% of checkout requests now take 10× longer — this directly impacts conversion rate	Revenue is the universal language
We have no test coverage on the payment module	One change to payments could silently break billing with no automated safety net	Risk is the executive's concern
The codebase has significant technical debt	We're spending ~40% of engineer time working around old decisions instead of building new features	Opportunity cost resonates
We need to migrate to a new auth library	The current library has an unpatched CVE — we're exposed to credential theft until we upgrade	Security risk is always a priority
We should refactor the monolith into services	Right now, all teams deploy together — one bug can block everyone. Splitting services lets teams ship independently	Team velocity is a business metric

THE THREE THINGS STAKEHOLDERS ALWAYS WANT TO KNOW:

1. WHAT IS THE IMPACT?
   "We reduced checkout latency by 75%"
   → "This is expected to improve conversion by ~2%, worth ~$400K/year"

2. WHAT ARE THE RISKS?
   Be specific about worst case. Quantify it.

3. WHAT DO YOU NEED FROM ME?
   "I need your approval to proceed by Friday."
   Never leave the meeting without a clear ask.

Surfacing bad news early:

GOOD: "We discovered a migration issue that will delay launch by one week.
      Here are three options: A, B, C. We recommend B. Here's why."

BAD:  [Silence until two days before launch]
      "We have a problem. We need two more weeks."

THE RULE: Surface problems the day you discover them.
          Come with options, not just problems.

05 · Disagreement & Alignment

THE LADDER OF DISAGREEMENT:

1. Raise the concern clearly, once, with evidence
2. Make sure you're heard: "Do you understand my concern about X?"
3. Propose an alternative with reasoning
4. Accept the decision and commit fully

WHAT NOT TO DO:
  ✗ Passive resistance after the decision is made
  ✗ Endless rehashing
  ✗ "I told you so" when it goes wrong
  ✗ Going around the decision-maker

06 · Mentoring & Teaching

💡 The Rule of Three: If you explain something more than twice, write it down. The third time someone asks, send the link.

Socratic Questioning

Guide with questions, not answers. "What do you think would happen if...?" "What are the tradeoffs of approach A vs B?" The mentee owns the reasoning.

WHEN: Debugging, design reviews, any time they're close to the answer

Rubber Duck Pairing

Have them explain their problem to you before you say anything. The act of explaining often surfaces the answer. Your role: listen carefully and ask clarifying questions.

WHEN: Bug hunts, unclear requirements, design confusion

Just-in-time Teaching

Teach the concept the moment they need it — not in the abstract. When they hit a real mutex bug, that's when you teach concurrency. The context makes it stick.

WHEN: Any time a real problem maps to a teachable concept

Show Your Work

When you solve something, narrate your thinking. "I'm checking the logs first because... then I'm going to look at..." Expose the process, not just the answer.

WHEN: Pair programming, live debugging, architecture sessions

Structured Feedback

SBI model: Situation (when X happened), Behaviour (you did Y), Impact (which led to Z). Specific, observable, factual. Not "you write messy code" but "in the PR last week, the function names made it hard to understand intent without reading the body."

WHEN: Performance conversations, regular 1:1s, code reviews

FORMAT FOR A 45-MIN LUNCH & LEARN:
  5 min  — Why this matters (motivation, not just content)
  10 min — Core concept with concrete example
  10 min — Live demo or code walkthrough
  10 min — "What would you do if..." scenarios
  10 min — Q&A / discussion

RULES:
  No slides without code or diagrams
  Pause every 10 minutes and ask a question
  End with: "What's one thing you'll try this week?"
  Write a 1-page summary and share it afterwards

🗺️ Communication Practice Plan

Daily (5 min): Before sending a Slack message, ask: "Am I being clear about what I need?" After every meeting, write one sentence: "The decision was X because Y."

Weekly (30 min): Write one internal Slack post sharing something you learned. Review one of your own PRs 24 hours after submitting — read it as a reviewer.

Monthly (2 hours): Write one RFC or design doc for something you're working on. Give a 10-minute lightning talk to your team.

Quarterly: Write a post-mortem for any incident. Get feedback on your writing from a senior engineer you respect.

✅ Mastery Checklist

0 / 14 complete0%

Written an RFC that prevented a bad technical decision

Written an ADR that explained a past decision to a future team member

Given a technical presentation to a mixed (technical + non-technical) audience

Used the C4 model to communicate a system at the right level of abstraction

Written a blameless post-mortem that led to a real process improvement

Given a code review with labelled feedback (BLOCKING / SUGGESTION / NIT)

Written a PR description that a reviewer could understand without asking questions

Translated a technical concern into a business impact statement

Surfaced a problem to leadership on the day it was discovered (not after)

Disagreed with a decision, made the case clearly, then committed fully

Mentored a less experienced engineer using at least one structured technique

Written a "Rule of Three" internal doc — something you'd explained twice already

Given a lunch-and-learn or lightning talk on a technical topic

Read The Staff Engineer's Path (Tanya Reilly)

📚 Resources

Resource	Type	Why
The Staff Engineer's Path (Tanya Reilly)	Book	Best book on engineering communication and influence
Writing for Software Developers (Phil Egherman)	Book	Technical writing specifically for engineers
An Elegant Puzzle (Will Larson)	Book	Engineering management and communication patterns
The Pyramid Principle (Barbara Minto)	Book	Structured communication from McKinsey

From good coder to great engineer.

📍 Module Overview

01 · Data Structures & Algorithms

Big-O: Cost Before Code

Core Data Structures

Algorithm Patterns

02 · Computational Thinking

The 7-Step Problem-Solving Framework

03 · Programming Depth

04 · Databases & Storage

CAP Theorem

When not to use SQL

05 · Networking Fundamentals

What happens when you type a URL and press Enter?

06 · Linux & the Command Line

🛠 Practice Projects

✅ Mastery Checklist

📍 Module Overview

🧠 The Systems Thinker's Mindset

01 · The 7-Step Design Framework

Back-of-envelope estimation

02 · Core Tradeoffs

Non-Functional Requirements

03 · Distributed Systems

04 · Caching

05 · Security

06 · Observability

The Four Golden Signals (Google SRE)

🛠 Design Exercises

✅ Mastery Checklist

📍 Module Overview

🧠 The Cloud-First Mental Model

01 · Cloud Fundamentals

02 · Containers & Orchestration

03 · Infrastructure as Code

04 · CI/CD Pipelines

05 · Serverless & Edge

06 · FinOps — Cost as Engineering

✅ Mastery Checklist

📍 Module Overview

🧠 The Engineer's Mental Model for AI

01 · ML Fundamentals

02 · LLMs & Prompt Engineering

RAG — Retrieval-Augmented Generation

03 · Vector Databases & Embeddings

04 · MLOps — AI in Production

05 · AI Safety & Responsible Engineering

06 · Applied AI — When to Use What

✅ Mastery Checklist

📍 Module Overview

🧠 The Craft Mindset

01 · Code as Communication

02 · SOLID Principles

03 · Design Patterns

04 · Architecture

05 · Testing Philosophy

06 · Refactoring & Technical Debt

07 · API Design

🛠 Practice: The Gilded Rose Kata

✅ Mastery Checklist

📍 Module Overview

🧠 Communication is Leverage

01 · Technical Writing

Writing a great RFC

Writing a great ADR

02 · Presenting Technical Work

03 · Code Review Culture

04 · Stakeholder Management

05 · Disagreement & Alignment

06 · Mentoring & Teaching

🗺️ Communication Practice Plan

✅ Mastery Checklist

📚 Resources