Error Handling
Never expose stack traces to clients—log internally, return generic message
Structured error responses: code, message, request ID—enables debugging without leaking
Fail fast on bad input—validate at entry point, not deep in business logic
Unexpected errors: 500 + alert—expected errors: appropriate 4xx
Input Validation
Validate everything from outside—query params, headers, body, path params
Whitelist valid input, don't blacklist bad—reject unknown fields
Validate early, before any processing—save resources, clearer errors
Size limits on all inputs—prevent memory exhaustion attacks
Timeouts Everywhere
Database queries: set timeout, typically 5-30s
External HTTP calls: connect timeout + read timeout—don't wait forever
Overall request timeout—gateway or middleware level
Background jobs: max execution time—prevent zombie processes
Retry Patterns
Exponential backoff: 1s, 2s, 4s, 8s...—prevents thundering herd
Add jitter: randomize delay—prevents synchronized retries
Idempotency keys for non-idempotent operations—safe to retry
Circuit breaker for failing dependencies—stop hammering, fail fast
Database Practices
Connection pooling: reuse connections—creating is expensive
Transactions scoped minimal—hold locks briefly
Read replicas for read-heavy workloads—separate read/write traffic
Prepared statements always—SQL injection prevention, query plan cache
Caching Strategy
Cache invalidation strategy decided upfront—TTL, event-based, or both
Cache at right layer: query result, computed value, HTTP response
Cache stampede prevention—lock or probabilistic early expiration
Monitor hit rate—low hit rate = wasted resources
Rate Limiting
Per-user/IP limits on expensive operations—login, signup, search
Different limits for different operations—read vs write
Return Retry-After header—tell clients when to retry
Rate limit early in request pipeline—save resources
Health Checks
Liveness: is process running—restart if fails
Readiness: can handle traffic—remove from load balancer if fails
Startup probe for slow-starting services—don't kill during init
Health checks fast and cheap—don't hit database on every probe
Graceful Shutdown
Stop accepting new requests first—drain load balancer
Wait for in-flight requests to complete—with timeout
Close database connections cleanly—prevent connection leaks
SIGTERM handling: graceful; SIGKILL after timeout
Logging
Structured logs (JSON)—parseable by log aggregators
Request ID in every log—trace request across services
Log level appropriate: debug for dev, info/error for prod
Sensitive data never logged—passwords, tokens, PII
API Design
Versioning strategy from day one—path (/v1/) or header
Pagination for list endpoints—cursor or offset; include total count
Consistent response format—same envelope everywhere
Meaningful status codes—201 for create, 204 for delete, 404 for not found
Security Hygiene
Secrets from environment or vault—never in code or config files
Dependencies updated regularly—automated with Dependabot/Renovate
Principle of least privilege—service accounts with minimal permissions
Authentication and authorization separated—who you are vs what you can do
Observability
Metrics: request count, latency percentiles, error rate—the RED method
Distributed tracing for microservices—follow request across services
Alerting on symptoms, not causes—high error rate, not CPU usage
Dashboards for operational visibility—know normal to spot abnormal