隨記

Kubernetes Pod Troubleshooting

This skill provides an interactive, step-by-step approach to troubleshooting common Kubernetes Pod issues in Azure Kubernetes Service (AKS), including restarts, OOM kills, and status problems.

Interactive Troubleshooting Workflow

When a user reports a Pod issue, follow this sequential workflow. Execute each step in order, analyze the output, and proceed based on findings. Continue until all diagnostic information is collected and downloaded.

Step 1: Establish Access to AKS Cluster

Execute the following commands in sequence to set up access:

Authenticate with Azure:
```
az login
```
Wait for successful login confirmation.
List available AKS clusters:
```
az aks list --output table
```
Identify the target cluster name and resource group.

Get cluster credentials (replace with actual values):

az aks get-credentials --resource-group <resource-group-name> --name <cluster-name>

Verify connection:

kubectl cluster-info
kubectl config current-context

Step 2: Identify Target Pod

List all namespaces:
```
kubectl get namespaces
```

List pods in the suspected namespace (or all namespaces):

kubectl get pods -n <namespace>

kubectl get pods --all-namespaces

Identify problematic pods by looking for:
High restart counts
Status: CrashLoopBackOff, Error, Pending
Not in Running state
Note the pod details:
Pod name
Namespace
Current status
Restart count

Step 3: Gather Pod Status Information

Execute these commands for the identified pod:

Get detailed pod description:
```
kubectl describe pod <pod-name> -n <namespace>
```
Analyze for status, containers, events, and volumes.

Check pod-specific events:

kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by=.metadata.creationTimestamp -n <namespace>

Look for resource constraints, image pull failures, network issues, or scheduling problems.

Step 4: Examine Pod Logs

Execute log retrieval commands:

Get current container logs:
```
kubectl logs <pod-name> -n <namespace>
```
Analyze for error messages, stack traces, and application issues.
If the pod has restarted, get previous container logs:
```
kubectl logs <pod-name> --previous -n <namespace>
```

Step 4.5: Generate Thread Dump (for Java applications)

If the pod runs a Java application and you need to analyze thread states:

Enter the pod container:

kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

Generate thread dump in the container:
```
jstack 1 > thread_dump.txt
```
Note: 1 is typically the Java process PID in containers.
Exit the pod shell:
```
exit
```

Copy thread dump file to local machine:

kubectl cp <pod-name>:thread_dump.txt -n <namespace> ./thread_dump.txt

Analyze the thread dump for:
Deadlocks
Blocked threads
High CPU usage threads
Thread state analysis

Step 4.6: Generate Heap Dump (for Java applications)

If the pod runs a Java application and you need to analyze memory usage:

Enter the pod container:

kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

Generate heap dump in the container:
```
jmap -dump:live,format=b,file=heapdump.hprof 1
```
Note: 1 is typically the Java process PID in containers. This creates a binary heap dump file.
Exit the pod shell:
```
exit
```

Copy heap dump file to local machine:

kubectl cp <pod-name>:heapdump.hprof -n <namespace> ./heapdump.hprof

Analyze the heap dump using tools like:
Eclipse Memory Analyzer (MAT)
VisualVM
jhat (Java Heap Analysis Tool)
Look for memory leaks, large objects, and memory usage patterns

Step 5: Check Resource Usage (if applicable)

If resource issues are suspected:

Check resource usage:
```
kubectl top pods -n <namespace>
```
Look for OOMKilled in describe output or high memory/CPU usage.

Step 6: Download All Diagnostic Information

Save all collected information to local files for analysis:

Download pod description:

kubectl describe pod <pod-name> -n <namespace> > pod_describe.txt

Download current logs:

kubectl logs <pod-name> -n <namespace> > pod_logs_current.txt

Download previous logs (if applicable):

kubectl logs <pod-name> --previous -n <namespace> > pod_logs_previous.txt

Download events:

kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by=.metadata.creationTimestamp -n <namespace> > pod_events.txt

Download thread dump (for Java applications):

kubectl cp <pod-name>:thread_dump.txt -n <namespace> ./thread_dump.txt

Download heap dump (for Java applications):

kubectl cp <pod-name>:heapdump.hprof -n <namespace> ./heapdump.hprof

Step 7: Analyze and Provide Solutions

Based on the collected information, identify the issue and provide solutions:

OOM Killed: Increase memory limits, optimize memory usage
CrashLoopBackOff: Check logs for errors, verify configuration
Pending: Check node resources, selectors, tolerations
ImagePullBackOff: Verify image credentials, network connectivity
Network Issues: Check services, endpoints, policies

Common Commands Reference

kubectl get pods -n <namespace>: List pods
kubectl describe pod <pod-name> -n <namespace>: Detailed pod info
kubectl logs <pod-name> -n <namespace>: Current logs
kubectl logs <pod-name> --previous -n <namespace>: Previous logs
kubectl get events -n <namespace>: Events
kubectl top pods -n <namespace>: Resource usage
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash: Debug shell
kubectl cp <pod-name>:<remote-path> -n <namespace> <local-path>: Copy files from pod to local

Best Practices

Execute commands sequentially and analyze each output before proceeding
Always download logs for complete diagnostic information
Check events first for quick diagnosis
Use --previous flag for logs after restarts
Monitor resource usage proactively
Generate thread dumps for Java applications when investigating performance issues or deadlocks
Generate heap dumps for Java applications when investigating memory leaks or high memory usage