Kubernetes Pod Troubleshooting

This skill provides an interactive, step-by-step approach to troubleshooting common Kubernetes Pod issues in Azure Kubernetes Service (AKS), including restarts, OOM kills, and status problems.

Interactive Troubleshooting Workflow

When a user reports a Pod issue, follow this sequential workflow. Execute each step in order, analyze the output, and proceed based on findings. Continue until all diagnostic information is collected and downloaded.

Step 1: Establish Access to AKS Cluster

Execute the following commands in sequence to set up access:

  1. Authenticate with Azure:

    az login
    
    Wait for successful login confirmation.

  2. List available AKS clusters:

    az aks list --output table
    
    Identify the target cluster name and resource group.

  3. Get cluster credentials (replace with actual values):

    az aks get-credentials --resource-group <resource-group-name> --name <cluster-name>
    

  4. Verify connection:

    kubectl cluster-info
    kubectl config current-context
    

Step 2: Identify Target Pod

  1. List all namespaces:

    kubectl get namespaces
    

  2. List pods in the suspected namespace (or all namespaces):

    kubectl get pods -n <namespace>
    
    OR
    kubectl get pods --all-namespaces
    

  3. Identify problematic pods by looking for:

  4. High restart counts
  5. Status: CrashLoopBackOff, Error, Pending
  6. Not in Running state

  7. Note the pod details:

  8. Pod name
  9. Namespace
  10. Current status
  11. Restart count

Step 3: Gather Pod Status Information

Execute these commands for the identified pod:

  1. Get detailed pod description:

    kubectl describe pod <pod-name> -n <namespace>
    
    Analyze for status, containers, events, and volumes.

  2. Check pod-specific events:

    kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by=.metadata.creationTimestamp -n <namespace>
    
    Look for resource constraints, image pull failures, network issues, or scheduling problems.

Step 4: Examine Pod Logs

Execute log retrieval commands:

  1. Get current container logs:

    kubectl logs <pod-name> -n <namespace>
    
    Analyze for error messages, stack traces, and application issues.

  2. If the pod has restarted, get previous container logs:

    kubectl logs <pod-name> --previous -n <namespace>
    

Step 4.5: Generate Thread Dump (for Java applications)

If the pod runs a Java application and you need to analyze thread states:

  1. Enter the pod container:

    kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
    

  2. Generate thread dump in the container:

    jstack 1 > thread_dump.txt
    
    Note: 1 is typically the Java process PID in containers.

  3. Exit the pod shell:

    exit
    

  4. Copy thread dump file to local machine:

    kubectl cp <pod-name>:thread_dump.txt -n <namespace> ./thread_dump.txt
    

  5. Analyze the thread dump for:

  6. Deadlocks
  7. Blocked threads
  8. High CPU usage threads
  9. Thread state analysis

Step 4.6: Generate Heap Dump (for Java applications)

If the pod runs a Java application and you need to analyze memory usage:

  1. Enter the pod container:

    kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
    

  2. Generate heap dump in the container:

    jmap -dump:live,format=b,file=heapdump.hprof 1
    
    Note: 1 is typically the Java process PID in containers. This creates a binary heap dump file.

  3. Exit the pod shell:

    exit
    

  4. Copy heap dump file to local machine:

    kubectl cp <pod-name>:heapdump.hprof -n <namespace> ./heapdump.hprof
    

  5. Analyze the heap dump using tools like:

  6. Eclipse Memory Analyzer (MAT)
  7. VisualVM
  8. jhat (Java Heap Analysis Tool)
  9. Look for memory leaks, large objects, and memory usage patterns

Step 5: Check Resource Usage (if applicable)

If resource issues are suspected:

  1. Check resource usage:

    kubectl top pods -n <namespace>
    

  2. Look for OOMKilled in describe output or high memory/CPU usage.

Step 6: Download All Diagnostic Information

Save all collected information to local files for analysis:

  1. Download pod description:

    kubectl describe pod <pod-name> -n <namespace> > pod_describe.txt
    

  2. Download current logs:

    kubectl logs <pod-name> -n <namespace> > pod_logs_current.txt
    

  3. Download previous logs (if applicable):

    kubectl logs <pod-name> --previous -n <namespace> > pod_logs_previous.txt
    

  4. Download events:

    kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by=.metadata.creationTimestamp -n <namespace> > pod_events.txt
    

  5. Download thread dump (for Java applications):

    kubectl cp <pod-name>:thread_dump.txt -n <namespace> ./thread_dump.txt
    

  6. Download heap dump (for Java applications):

    kubectl cp <pod-name>:heapdump.hprof -n <namespace> ./heapdump.hprof
    

Step 7: Analyze and Provide Solutions

Based on the collected information, identify the issue and provide solutions:

  • OOM Killed: Increase memory limits, optimize memory usage
  • CrashLoopBackOff: Check logs for errors, verify configuration
  • Pending: Check node resources, selectors, tolerations
  • ImagePullBackOff: Verify image credentials, network connectivity
  • Network Issues: Check services, endpoints, policies

Common Commands Reference

  • kubectl get pods -n <namespace>: List pods
  • kubectl describe pod <pod-name> -n <namespace>: Detailed pod info
  • kubectl logs <pod-name> -n <namespace>: Current logs
  • kubectl logs <pod-name> --previous -n <namespace>: Previous logs
  • kubectl get events -n <namespace>: Events
  • kubectl top pods -n <namespace>: Resource usage
  • kubectl exec -it <pod-name> -n <namespace> -- /bin/bash: Debug shell
  • kubectl cp <pod-name>:<remote-path> -n <namespace> <local-path>: Copy files from pod to local

Best Practices

  • Execute commands sequentially and analyze each output before proceeding
  • Always download logs for complete diagnostic information
  • Check events first for quick diagnosis
  • Use --previous flag for logs after restarts
  • Monitor resource usage proactively
  • Generate thread dumps for Java applications when investigating performance issues or deadlocks
  • Generate heap dumps for Java applications when investigating memory leaks or high memory usage