I've seen too many teams deploy to Kubernetes and hit the same problems over and over. Resources getting killed unexpectedly. Pods not scheduling. Services not accessible.
After managing hundreds of Kubernetes deployments, I've learned that most issues are preventable. Here's the checklist I use before any production deployment.
The Pre-Deployment Checklist
1. Resource Limits and Requests (This One is Critical)
Kubernetes needs to know how much CPU and memory your pods need. Without resource requests and limits, your pods can:
- Get killed when nodes run out of resources
- Starve other pods on the same node
- Fail to schedule when nodes don't have enough capacity
Always set both requests and limits:
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
Requests tell Kubernetes what your pod needs. Limits tell Kubernetes what your pod can use at maximum. Set both.
2. Health Checks (Liveness and Readiness Probes)
Without health checks, Kubernetes doesn't know if your pod is actually working. I've seen pods that look running but are completely broken.
Always set both liveness and readiness probes:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Liveness probe: If this fails, Kubernetes restarts your pod. Readiness probe: If this fails, Kubernetes stops sending traffic to your pod.
3. Pod Disruption Budgets (PDBs)
When Kubernetes nodes need maintenance, or when you update your deployment, pods get evicted. Without PDBs, all your pods might get killed at once, causing downtime.
Always set a Pod Disruption Budget:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
This tells Kubernetes: "Never have fewer than 2 pods running." Your application stays available during updates and maintenance.
4. Replica Counts
Running one replica is asking for downtime. Running too many replicas is wasting money.
For production, I recommend:
- At least 2 replicas (for high availability)
- More replicas if you expect traffic spikes
- Use Horizontal Pod Autoscaler (HPA) for automatic scaling
5. Namespace Isolation
Don't put everything in the default namespace. Use namespaces to:
- Isolate environments (dev, staging, prod)
- Organize resources
- Apply different resource quotas
- Control access with RBAC
6. Network Policies
By default, all pods in Kubernetes can talk to each other. That's a security risk. Use Network Policies to restrict traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: my-app-netpol
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
7. Secrets Management
Don't hardcode passwords, API keys, or other secrets in your YAML files. Use Kubernetes Secrets or external secret management (like AWS Secrets Manager, HashiCorp Vault).
Never commit secrets to Git. Ever. I've seen teams do this. Don't be that team.
8. Persistent Volume Claims
If your pods need storage, use PersistentVolumeClaims (PVCs). Don't rely on local storage on nodes - nodes can be replaced, and your data will be lost.
Set appropriate storage classes and retention policies.
9. Image Pull Policies
Always use specific image tags, not "latest". Using "latest" means you never know what version is running:
# Good
image: my-app:v1.2.3
# Bad
image: my-app:latest
10. Service Accounts and RBAC
Don't use the default service account. Create service accounts for your applications with least-privilege access:
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: my-app-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
What I Check Before Every Production Deployment
- ✓ Resource requests and limits are set
- ✓ Health checks are configured
- ✓ Pod Disruption Budgets are in place
- ✓ At least 2 replicas are running
- ✓ Namespaces are properly organized
- ✓ Network policies restrict traffic
- ✓ Secrets are not hardcoded
- ✓ Persistent volumes are used (if needed)
- ✓ Image tags are specific (not "latest")
- ✓ Service accounts have least-privilege access
Common Mistakes I See
Here's what I see teams do wrong most often:
- No resource limits: Pods consume all available resources, causing node instability
- No health checks: Broken pods keep getting traffic
- Single replica: Any pod restart causes downtime
- Using "latest" tags: Can't reproduce issues, can't roll back
- Hardcoded secrets: Security risk, can't rotate credentials
Start Small, Build Up
Don't try to implement everything at once. Start with the basics:
- Set resource limits (prevents most stability issues)
- Add health checks (ensures traffic only goes to healthy pods)
- Use at least 2 replicas (provides high availability)
Then add the rest as you go. Security policies, PDBs, network policies - these are important, but the basics matter more.
Kubernetes is powerful, but it's not magic. You still need to configure it properly. Use this checklist, and you'll avoid most of the common issues I see teams hit.
What Kubernetes issues have you run into? What would you add to this checklist? I'd love to hear what's worked (or hasn't worked) for your deployments.