DevOps

Automate deployments, manage infrastructure, and build reliable CI/CD pipelines.

Instalar
$clawhub install devops

DevOps Rules

CI/CD Pipelines

  • Fail fast: run linting and unit tests before expensive integration tests — saves time and compute

  • Cache dependencies between runs — npm install on every build wastes minutes

  • Pin action versions with SHA, not tags — actions/checkout@v3 can change, SHA is immutable

  • Secrets in environment variables, never in code or logs — mask them in CI output

  • Parallel jobs for independent steps — test, lint, and build can run simultaneously

Deployment Strategies

  • Blue-green: run new version alongside old, switch traffic atomically — instant rollback by switching back

  • Canary: route percentage of traffic to new version — catch issues before full rollout

  • Rolling: update instances incrementally — balance between speed and risk

  • Always have rollback plan before deploying — know exactly how to revert

  • Deploy the same artifact to all environments — build once, promote through stages

Infrastructure as Code

  • Version control all infrastructure — terraform, ansible, cloudformation in git

  • Never apply changes without plan/diff review — terraform plan before apply

  • State files contain secrets — store remotely with encryption, never in git

  • Modules for reusable components — don't copy-paste infrastructure definitions

  • Separate environments with workspaces or directories — dev changes shouldn't affect prod

Containers

  • One process per container — containers are not VMs

  • Health checks are mandatory — orchestrators need them for routing and restarts

  • Don't run as root — use non-root USER in Dockerfile

  • Immutable images: config via environment, not baked in — same image in all environments

  • Tag images with git SHA, not just latest — know exactly what's deployed

Secrets Management

  • Never store secrets in environment files committed to git — use vault, sealed secrets, or CI secret storage

  • Rotate secrets regularly — automation makes rotation painless

  • Different secrets per environment — dev leak shouldn't compromise prod

  • Audit secret access — know who accessed what and when

  • Secrets in memory, not disk when possible — temp files persist longer than expected

Monitoring & Alerting

  • Four golden signals: latency, traffic, errors, saturation — start here

  • Alert on symptoms, not causes — "users seeing errors" not "CPU high"

  • Every alert must be actionable — if you can't do anything, it's noise

  • Dashboard per service with key metrics — one glance shows health

  • Structured logs (JSON) for machine parsing — grep works, but queries are better

Reliability

  • Define SLOs before building alerting — what does "healthy" mean for this service?

  • Error budgets: some failures are acceptable — 99.9% means 8 hours downtime/year is OK

  • Chaos engineering in staging — break things intentionally before prod breaks accidentally

  • Runbooks for common incidents — 3am is not the time to figure out recovery steps

  • Post-mortems without blame — focus on systems, not people

Common Mistakes

  • SSH into prod to fix things — all changes through automation, or you'll forget what you did

  • No staging environment — "works on my machine" doesn't mean works in prod

  • Ignoring flaky tests — they erode trust in CI, either fix or delete

  • Manual steps in deployment — if it's not automated, it'll be done wrong eventually

  • Monitoring only happy paths — check error rates and edge cases too

Networking

  • Internal services don't need public IPs — use private subnets, expose only load balancers

  • TLS everywhere, including internal traffic — zero trust, even behind firewall

  • DNS for service discovery — hardcoded IPs break when things move

  • Load balancer health checks separate from app health — LB needs fast response, app health can be thorough

  • Firewall default deny — explicitly allow what's needed, block everything else