DevOps

DevOps Rules

CI/CD Pipelines

Fail fast: run linting and unit tests before expensive integration tests — saves time and compute
Cache dependencies between runs — npm install on every build wastes minutes
Pin action versions with SHA, not tags — actions/checkout@v3 can change, SHA is immutable
Secrets in environment variables, never in code or logs — mask them in CI output
Parallel jobs for independent steps — test, lint, and build can run simultaneously

Deployment Strategies

Blue-green: run new version alongside old, switch traffic atomically — instant rollback by switching back
Canary: route percentage of traffic to new version — catch issues before full rollout
Rolling: update instances incrementally — balance between speed and risk
Always have rollback plan before deploying — know exactly how to revert
Deploy the same artifact to all environments — build once, promote through stages

Infrastructure as Code

Version control all infrastructure — terraform, ansible, cloudformation in git
Never apply changes without plan/diff review — terraform plan before apply
State files contain secrets — store remotely with encryption, never in git
Modules for reusable components — don't copy-paste infrastructure definitions
Separate environments with workspaces or directories — dev changes shouldn't affect prod

Containers

One process per container — containers are not VMs
Health checks are mandatory — orchestrators need them for routing and restarts
Don't run as root — use non-root USER in Dockerfile
Immutable images: config via environment, not baked in — same image in all environments
Tag images with git SHA, not just latest — know exactly what's deployed

Secrets Management

Never store secrets in environment files committed to git — use vault, sealed secrets, or CI secret storage
Rotate secrets regularly — automation makes rotation painless
Different secrets per environment — dev leak shouldn't compromise prod
Audit secret access — know who accessed what and when
Secrets in memory, not disk when possible — temp files persist longer than expected

Monitoring & Alerting

Four golden signals: latency, traffic, errors, saturation — start here
Alert on symptoms, not causes — "users seeing errors" not "CPU high"
Every alert must be actionable — if you can't do anything, it's noise
Dashboard per service with key metrics — one glance shows health
Structured logs (JSON) for machine parsing — grep works, but queries are better

Reliability

Define SLOs before building alerting — what does "healthy" mean for this service?
Error budgets: some failures are acceptable — 99.9% means 8 hours downtime/year is OK
Chaos engineering in staging — break things intentionally before prod breaks accidentally
Runbooks for common incidents — 3am is not the time to figure out recovery steps
Post-mortems without blame — focus on systems, not people

Common Mistakes

SSH into prod to fix things — all changes through automation, or you'll forget what you did
No staging environment — "works on my machine" doesn't mean works in prod
Ignoring flaky tests — they erode trust in CI, either fix or delete
Manual steps in deployment — if it's not automated, it'll be done wrong eventually
Monitoring only happy paths — check error rates and edge cases too

Networking

Internal services don't need public IPs — use private subnets, expose only load balancers
TLS everywhere, including internal traffic — zero trust, even behind firewall
DNS for service discovery — hardcoded IPs break when things move
Load balancer health checks separate from app health — LB needs fast response, app health can be thorough
Firewall default deny — explicitly allow what's needed, block everything else