Kubernetes9 min read

The Kubernetes Production Checklist That Prevents 90% of Common Issues

Published on December 3, 2025

KubernetesDockerDevOpsLinux

I've seen too many teams deploy to Kubernetes and hit the same problems over and over. Resources getting killed unexpectedly. Pods not scheduling. Services not accessible.

After managing hundreds of Kubernetes deployments, I've learned that most issues are preventable. Here's the checklist I use before any production deployment.

The Pre-Deployment Checklist

1. Resource Limits and Requests (This One is Critical)

Kubernetes needs to know how much CPU and memory your pods need. Without resource requests and limits, your pods can:

Get killed when nodes run out of resources
Starve other pods on the same node
Fail to schedule when nodes don't have enough capacity

Always set both requests and limits:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    cpu: "200m"

Requests tell Kubernetes what your pod needs. Limits tell Kubernetes what your pod can use at maximum. Set both.

2. Health Checks (Liveness and Readiness Probes)

Without health checks, Kubernetes doesn't know if your pod is actually working. I've seen pods that look running but are completely broken.

Always set both liveness and readiness probes:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Liveness probe: If this fails, Kubernetes restarts your pod. Readiness probe: If this fails, Kubernetes stops sending traffic to your pod.

3. Pod Disruption Budgets (PDBs)

When Kubernetes nodes need maintenance, or when you update your deployment, pods get evicted. Without PDBs, all your pods might get killed at once, causing downtime.

Always set a Pod Disruption Budget:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

This tells Kubernetes: "Never have fewer than 2 pods running." Your application stays available during updates and maintenance.

4. Replica Counts

Running one replica is asking for downtime. Running too many replicas is wasting money.

For production, I recommend:

At least 2 replicas (for high availability)
More replicas if you expect traffic spikes
Use Horizontal Pod Autoscaler (HPA) for automatic scaling

5. Namespace Isolation

Don't put everything in the default namespace. Use namespaces to:

Isolate environments (dev, staging, prod)
Organize resources
Apply different resource quotas
Control access with RBAC

6. Network Policies

By default, all pods in Kubernetes can talk to each other. That's a security risk. Use Network Policies to restrict traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-app-netpol
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

7. Secrets Management

Don't hardcode passwords, API keys, or other secrets in your YAML files. Use Kubernetes Secrets or external secret management (like AWS Secrets Manager, HashiCorp Vault).

Never commit secrets to Git. Ever. I've seen teams do this. Don't be that team.

8. Persistent Volume Claims

If your pods need storage, use PersistentVolumeClaims (PVCs). Don't rely on local storage on nodes - nodes can be replaced, and your data will be lost.

Set appropriate storage classes and retention policies.

9. Image Pull Policies

Always use specific image tags, not "latest". Using "latest" means you never know what version is running:

# Good
image: my-app:v1.2.3

# Bad
image: my-app:latest

10. Service Accounts and RBAC

Don't use the default service account. Create service accounts for your applications with least-privilege access:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: my-app-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]

What I Check Before Every Production Deployment

✓ Resource requests and limits are set
✓ Health checks are configured
✓ Pod Disruption Budgets are in place
✓ At least 2 replicas are running
✓ Namespaces are properly organized
✓ Network policies restrict traffic
✓ Secrets are not hardcoded
✓ Persistent volumes are used (if needed)
✓ Image tags are specific (not "latest")
✓ Service accounts have least-privilege access

Common Mistakes I See

Here's what I see teams do wrong most often:

No resource limits: Pods consume all available resources, causing node instability
No health checks: Broken pods keep getting traffic
Single replica: Any pod restart causes downtime
Using "latest" tags: Can't reproduce issues, can't roll back
Hardcoded secrets: Security risk, can't rotate credentials

Start Small, Build Up

Don't try to implement everything at once. Start with the basics:

Set resource limits (prevents most stability issues)
Add health checks (ensures traffic only goes to healthy pods)
Use at least 2 replicas (provides high availability)

Then add the rest as you go. Security policies, PDBs, network policies - these are important, but the basics matter more.

Kubernetes is powerful, but it's not magic. You still need to configure it properly. Use this checklist, and you'll avoid most of the common issues I see teams hit.

What Kubernetes issues have you run into? What would you add to this checklist? I'd love to hear what's worked (or hasn't worked) for your deployments.

Related Services

Need help implementing these strategies? Explore our related DevOps services:

Kubernetes EKS Services

Written by CloudOps Innovation — Expert DevOps & Cloud Infrastructure Services for Global Teams. 580+ clients, 10,500+ hours of expertise. Learn more or view our services.

Need Help With This at Scale?

If you're facing container orchestration challenges at scale, our Kubernetes & EKS team helps companies handle 10x traffic spikes automatically and reduce infrastructure costs by 60%.

← Back to All Posts