Kubernetes Pod Troubleshooting
This skill provides an interactive, step-by-step approach to troubleshooting common Kubernetes Pod issues in Azure Kubernetes Service (AKS), including restarts, OOM kills, and status problems.
Interactive Troubleshooting Workflow
When a user reports a Pod issue, follow this sequential workflow. Execute each step in order, analyze the output, and proceed based on findings. Continue until all diagnostic information is collected and downloaded.
Step 1: Establish Access to AKS Cluster
Execute the following commands in sequence to set up access:
-
Authenticate with Azure:
Wait for successful login confirmation.az login -
List available AKS clusters:
Identify the target cluster name and resource group.az aks list --output table -
Get cluster credentials (replace with actual values):
az aks get-credentials --resource-group <resource-group-name> --name <cluster-name> -
Verify connection:
kubectl cluster-info kubectl config current-context
Step 2: Identify Target Pod
-
List all namespaces:
kubectl get namespaces -
List pods in the suspected namespace (or all namespaces):
ORkubectl get pods -n <namespace>kubectl get pods --all-namespaces -
Identify problematic pods by looking for:
- High restart counts
- Status: CrashLoopBackOff, Error, Pending
-
Not in Running state
-
Note the pod details:
- Pod name
- Namespace
- Current status
- Restart count
Step 3: Gather Pod Status Information
Execute these commands for the identified pod:
-
Get detailed pod description:
Analyze for status, containers, events, and volumes.kubectl describe pod <pod-name> -n <namespace> -
Check pod-specific events:
Look for resource constraints, image pull failures, network issues, or scheduling problems.kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by=.metadata.creationTimestamp -n <namespace>
Step 4: Examine Pod Logs
Execute log retrieval commands:
-
Get current container logs:
Analyze for error messages, stack traces, and application issues.kubectl logs <pod-name> -n <namespace> -
If the pod has restarted, get previous container logs:
kubectl logs <pod-name> --previous -n <namespace>
Step 4.5: Generate Thread Dump (for Java applications)
If the pod runs a Java application and you need to analyze thread states:
-
Enter the pod container:
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash -
Generate thread dump in the container:
Note: 1 is typically the Java process PID in containers.jstack 1 > thread_dump.txt -
Exit the pod shell:
exit -
Copy thread dump file to local machine:
kubectl cp <pod-name>:thread_dump.txt -n <namespace> ./thread_dump.txt -
Analyze the thread dump for:
- Deadlocks
- Blocked threads
- High CPU usage threads
- Thread state analysis
Step 4.6: Generate Heap Dump (for Java applications)
If the pod runs a Java application and you need to analyze memory usage:
-
Enter the pod container:
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash -
Generate heap dump in the container:
Note: 1 is typically the Java process PID in containers. This creates a binary heap dump file.jmap -dump:live,format=b,file=heapdump.hprof 1 -
Exit the pod shell:
exit -
Copy heap dump file to local machine:
kubectl cp <pod-name>:heapdump.hprof -n <namespace> ./heapdump.hprof -
Analyze the heap dump using tools like:
- Eclipse Memory Analyzer (MAT)
- VisualVM
- jhat (Java Heap Analysis Tool)
- Look for memory leaks, large objects, and memory usage patterns
Step 5: Check Resource Usage (if applicable)
If resource issues are suspected:
-
Check resource usage:
kubectl top pods -n <namespace> -
Look for OOMKilled in describe output or high memory/CPU usage.
Step 6: Download All Diagnostic Information
Save all collected information to local files for analysis:
-
Download pod description:
kubectl describe pod <pod-name> -n <namespace> > pod_describe.txt -
Download current logs:
kubectl logs <pod-name> -n <namespace> > pod_logs_current.txt -
Download previous logs (if applicable):
kubectl logs <pod-name> --previous -n <namespace> > pod_logs_previous.txt -
Download events:
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by=.metadata.creationTimestamp -n <namespace> > pod_events.txt -
Download thread dump (for Java applications):
kubectl cp <pod-name>:thread_dump.txt -n <namespace> ./thread_dump.txt -
Download heap dump (for Java applications):
kubectl cp <pod-name>:heapdump.hprof -n <namespace> ./heapdump.hprof
Step 7: Analyze and Provide Solutions
Based on the collected information, identify the issue and provide solutions:
- OOM Killed: Increase memory limits, optimize memory usage
- CrashLoopBackOff: Check logs for errors, verify configuration
- Pending: Check node resources, selectors, tolerations
- ImagePullBackOff: Verify image credentials, network connectivity
- Network Issues: Check services, endpoints, policies
Common Commands Reference
kubectl get pods -n <namespace>: List podskubectl describe pod <pod-name> -n <namespace>: Detailed pod infokubectl logs <pod-name> -n <namespace>: Current logskubectl logs <pod-name> --previous -n <namespace>: Previous logskubectl get events -n <namespace>: Eventskubectl top pods -n <namespace>: Resource usagekubectl exec -it <pod-name> -n <namespace> -- /bin/bash: Debug shellkubectl cp <pod-name>:<remote-path> -n <namespace> <local-path>: Copy files from pod to local
Best Practices
- Execute commands sequentially and analyze each output before proceeding
- Always download logs for complete diagnostic information
- Check events first for quick diagnosis
- Use --previous flag for logs after restarts
- Monitor resource usage proactively
- Generate thread dumps for Java applications when investigating performance issues or deadlocks
- Generate heap dumps for Java applications when investigating memory leaks or high memory usage