EKS Troubleshooting

Phần này tổng hợp các lỗi thường gặp khi vận hành EKS và cách xử lý: từ Pod không chạy, Node không join cluster, networking issues, đến các vấn đề về IAM, storage và cluster upgrade.

📌 1. Quy trình Debug tổng quan

1.1. Mindset khi troubleshoot

Luôn đi từ tổng quan → chi tiết, theo thứ tự:

1.2. Các lệnh debug cơ bản

# Tổng quan cluster
kubectl cluster-info
kubectl get nodes -o wide
kubectl get pods --all-namespaces | grep -v Running

# Chi tiết một Pod
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # log của container trước khi crash

# Chi tiết một Node
kubectl describe node <node-name>

# Events gần đây (sắp xếp theo thời gian)
kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace>

🔴 2. Pod Issues

2.1. Pod Pending

Triệu chứng: Pod mãi ở trạng thái Pending, không được schedule lên Node.

Nguyên nhân phổ biến:

Nguyên nhân	Cách kiểm tra	Giải pháp
Thiếu resource (CPU/Memory)	`kubectl describe pod` → Events: "Insufficient cpu"	Giảm `requests` hoặc thêm Node
Node taint không tolerate	`kubectl describe pod` → Events: "didn't tolerate"	Thêm `tolerations` vào Pod spec
nodeSelector/affinity không match	`kubectl describe pod` → Events: "didn't match"	Sửa selector hoặc label Node
PVC không bind được	`kubectl get pvc` → Pending	Check StorageClass, CSI driver
Không đủ IP (VPC CNI)	Node events: "failed to assign an IP"	Tăng subnet CIDR hoặc dùng prefix delegation

Debug step-by-step:

# 1. Xem events của Pod
kubectl describe pod <pod-name> | tail -20

# 2. Kiểm tra resource available trên Node
kubectl describe nodes | grep -A 5 "Allocated resources"

# 3. Kiểm tra taint trên Node
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# 4. Kiểm tra PVC nếu Pod dùng volume
kubectl get pvc -n <namespace>

2.2. CrashLoopBackOff

Triệu chứng: Pod restart liên tục, status CrashLoopBackOff.

Nguyên nhân phổ biến:

Nguyên nhân	Cách kiểm tra	Giải pháp
Application error/crash	`kubectl logs <pod>`	Fix application code
Config sai (env, mount)	`kubectl logs <pod> --previous`	Kiểm tra ConfigMap, Secret
Liveness probe fail	`kubectl describe pod` → Events	Tăng `initialDelaySeconds`, sửa probe
OOMKilled	`kubectl describe pod` → Last State: OOMKilled	Tăng `resources.limits.memory`
Permission denied (filesystem)	`kubectl logs <pod>`	Sửa `securityContext.runAsUser`

Debug step-by-step:

# 1. Xem log container hiện tại
kubectl logs <pod-name> -n <namespace>

# 2. Xem log container trước khi crash
kubectl logs <pod-name> -n <namespace> --previous

# 3. Check exit code
kubectl describe pod <pod-name> | grep -A 3 "Last State"
# Exit code 137 = OOMKilled
# Exit code 1   = Application error
# Exit code 0   = Container finished (CronJob OK)

# 4. Exec vào container debug (nếu chạy được)
kubectl exec -it <pod-name> -- /bin/sh

# 5. Dùng ephemeral debug container (nếu container không có shell)
kubectl debug -it <pod-name> --image=busybox --target=<container-name>

2.3. ImagePullBackOff

Triệu chứng: Pod không pull được container image.

# Kiểm tra chi tiết
kubectl describe pod <pod-name> | grep -A 5 "Events"

Nguyên nhân	Message	Giải pháp
Image không tồn tại	"manifest unknown"	Kiểm tra image name:tag
ECR auth expired	"no basic auth credentials"	Check IRSA hoặc node IAM role
Docker Hub rate limit	"toomanyrequests"	Dùng ECR pull-through cache
Private registry, thiếu secret	"unauthorized"	Tạo `imagePullSecrets`

Fix ECR authentication:

# Kiểm tra Node role có quyền pull ECR
aws iam list-attached-role-policies --role-name MyEKSNodeRole | grep ECR

# Node role cần policy: AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy \
  --role-name MyEKSNodeRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

2.4. OOMKilled

Triệu chứng: Container bị kill vì dùng quá nhiều memory.

# Xác nhận OOMKilled
kubectl describe pod <pod-name> | grep -A 3 "Last State"
# Reason: OOMKilled, Exit Code: 137

# Xem memory usage hiện tại
kubectl top pod <pod-name>

# So sánh với limits
kubectl get pod <pod-name> -o jsonpath='{.spec.containers[0].resources}'

Giải pháp:

resources:
  requests:
    memory: 512Mi   # Tăng lên phù hợp với actual usage
  limits:
    memory: 1Gi     # Tăng limit, nhưng cần cân nhắc với tổng resource cluster

Mẹo xử lý OOMKilled

Dùng VPA mode Off để xem recommendation trước khi set giá trị.
Set limits ≥ 1.5x requests để có headroom.
Kiểm tra memory leak trong application (Java heap, Node.js).

🖥️ 3. Node Issues

3.1. Node NotReady

Triệu chứng: Node xuất hiện trạng thái NotReady.

# Kiểm tra status
kubectl get nodes
kubectl describe node <node-name> | grep -A 10 "Conditions"

Nguyên nhân	Cách kiểm tra	Giải pháp
kubelet stopped	SSH → `systemctl status kubelet`	`systemctl restart kubelet`
Disk pressure	Conditions → DiskPressure=True	Dọn dẹp images/logs, tăng disk
Memory pressure	Conditions → MemoryPressure=True	Drain node, kiểm tra Pod leak
Network unreachable	ping Node IP	Kiểm tra Security Group, VPC
containerd/docker crash	SSH → `systemctl status containerd`	Restart containerd

Debug từ Node (SSH hoặc SSM):

# Connect qua SSM (không cần SSH key)
aws ssm start-session --target <instance-id>

# Check kubelet
sudo systemctl status kubelet
sudo journalctl -u kubelet --since "30 minutes ago" | tail -50

# Check disk
df -h
du -sh /var/lib/containerd/*

# Check containerd
sudo systemctl status containerd
sudo crictl ps
sudo crictl images

3.2. Node không join cluster

Triệu chứng: EC2 instance đã launch nhưng không xuất hiện trong kubectl get nodes.

Debug:

# Kiểm tra Node Group status
aws eks describe-nodegroup \
  --cluster-name my-eks \
  --nodegroup-name my-nodegroup \
  --query 'nodegroup.health'

# Kiểm tra ASG instances
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names <asg-name> \
  --query 'AutoScalingGroups[0].Instances[*].[InstanceId,LifecycleState,HealthStatus]'

Nguyên nhân	Giải pháp
Security Group chặn traffic đến Control Plane	Mở port 443 từ Node SG đến Cluster SG
Thiếu IAM permissions trên Node Role	Attach: AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly
Subnet không có route đến EKS endpoint	Kiểm tra route table, NAT Gateway
User data / bootstrap script lỗi	Check `/var/log/cloud-init-output.log` trên instance
AMI không tương thích	Dùng EKS Optimized AMI đúng version

3.3. Không đủ IP addresses

Triệu chứng: Pod Pending — "failed to assign an IP address".

VPC CNI cấp IP trực tiếp từ subnet cho mỗi Pod. Với subnet nhỏ, IP dễ hết.

# Kiểm tra IP available trên từng subnet
aws ec2 describe-subnets \
  --subnet-ids subnet-xxx \
  --query 'Subnets[*].[SubnetId,AvailableIpAddressCount,CidrBlock]'

# Kiểm tra ENI trên Node
kubectl get node <node-name> -o jsonpath='{.status.capacity.pods}'

Giải pháp:

Phương án	Mô tả
Prefix delegation	Cấp /28 prefix thay vì từng IP — tăng ~16x Pod/ENI
Custom networking	Dùng secondary CIDR cho Pod (100.64.0.0/16)
Tăng subnet size	Mở rộng CIDR của VPC/subnet

Bật prefix delegation:

kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
kubectl set env daemonset aws-node -n kube-system WARM_PREFIX_TARGET=1

🌐 4. Networking Issues

4.1. Pod không kết nối được ra ngoài (Internet)

# Test từ trong Pod
kubectl exec -it <pod-name> -- curl -I https://google.com
kubectl exec -it <pod-name> -- nslookup google.com

Nguyên nhân	Giải pháp
Thiếu NAT Gateway	Tạo NAT Gateway cho private subnet
Route table sai	Kiểm tra 0.0.0.0/0 → NAT Gateway
Security Group chặn outbound	Mở outbound rule
Network Policy deny egress	Kiểm tra NetworkPolicy
CoreDNS không hoạt động	`kubectl get pods -n kube-system -l k8s-app=kube-dns`

4.2. Service không kết nối được giữa các Pod

# Test DNS resolution
kubectl exec -it <pod-name> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Test connectivity
kubectl exec -it <pod-name> -- curl <service-name>.<namespace>:80

Nguyên nhân	Giải pháp
Service selector không match Pod labels	`kubectl describe svc` → check Endpoints
Endpoints trống	Pod chưa Ready hoặc label sai
kube-proxy lỗi	`kubectl get pods -n kube-system -l k8s-app=kube-proxy`
Network Policy chặn	Kiểm tra ingress/egress NetworkPolicy

Quick check Endpoints:

# Có endpoints = Service đang trỏ đúng Pod
kubectl get endpoints <service-name> -n <namespace>

# Không có endpoints = selector không match
kubectl describe svc <service-name> -n <namespace>
kubectl get pods -n <namespace> --show-labels

4.3. CoreDNS Issues

Triệu chứng: DNS resolution chậm hoặc fail.

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Test DNS từ debug Pod
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default

Giải pháp phổ biến:

# Tăng replicas CoreDNS khi cluster lớn
kubectl scale deployment coredns -n kube-system --replicas=4

# Hoặc dùng NodeLocal DNSCache
# Giảm latency DNS bằng cách cache DNS trên mỗi Node

4.4. Load Balancer không hoạt động

ALB/NLB không tạo được:

# Check AWS Load Balancer Controller
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=50

# Check Ingress status
kubectl describe ingress <ingress-name>

# Check target group health
aws elbv2 describe-target-health --target-group-arn <tg-arn>

Nguyên nhân	Giải pháp
Controller chưa cài	Cài AWS Load Balancer Controller
Thiếu IAM permissions	Kiểm tra IRSA policy
Subnet thiếu tag	Tag subnet: `kubernetes.io/role/elb=1` (public) hoặc `kubernetes.io/role/internal-elb=1` (private)
IngressClass sai	Set `ingressClassName: alb`

🔑 5. IAM & Authentication Issues

5.1. Không thể kubectl vào cluster

Triệu chứng: error: You must be logged in to the server (Unauthorized)

# Update kubeconfig
aws eks update-kubeconfig --name my-eks --region ap-southeast-1

# Kiểm tra identity hiện tại
aws sts get-caller-identity

# Kiểm tra access entry (EKS API mode)
aws eks list-access-entries --cluster-name my-eks

Nguyên nhân	Giải pháp
Sai AWS profile/role	`export AWS_PROFILE=correct-profile`
Chưa có access entry	Tạo access entry cho IAM user/role
aws-auth ConfigMap sai	Kiểm tra `kubectl get cm aws-auth -n kube-system -o yaml`
Token expired	`aws eks get-token --cluster-name my-eks`

5.2. IRSA (IAM Roles for Service Accounts) không hoạt động

Triệu chứng: Pod trả về "AccessDenied" khi gọi AWS API.

# Kiểm tra ServiceAccount annotation
kubectl get sa <sa-name> -n <namespace> -o yaml | grep eks.amazonaws.com

# Kiểm tra env trong Pod (phải có AWS_ROLE_ARN và AWS_WEB_IDENTITY_TOKEN_FILE)
kubectl exec -it <pod-name> -- env | grep AWS

# Test từ trong Pod
kubectl exec -it <pod-name> -- aws sts get-caller-identity

Nguyên nhân	Giải pháp
ServiceAccount thiếu annotation `eks.amazonaws.com/role-arn`	Thêm annotation đúng
OIDC Provider chưa tạo	Tạo OIDC provider cho cluster
Trust policy sai	Kiểm tra Condition trong trust policy (issuer URL, service account)
Pod không dùng đúng ServiceAccount	Set `serviceAccountName` trong Pod spec

Kiểm tra trust policy:

aws iam get-role --role-name MyRole --query 'Role.AssumeRolePolicyDocument'
# Verify: Federated = OIDC provider ARN
# Condition StringEquals: sub = system:serviceaccount:<namespace>:<sa-name>

💾 6. Storage Issues

6.1. PVC Pending

kubectl describe pvc <pvc-name> -n <namespace>

Nguyên nhân	Message	Giải pháp
CSI Driver chưa cài	"no persistent volumes available"	Cài EBS/EFS CSI Driver
StorageClass không tồn tại	"storageclass not found"	Tạo StorageClass
IAM thiếu quyền	"could not create volume"	Kiểm tra IRSA cho CSI Driver
EBS: Volume khác AZ	"volume is in AZ-a, node is in AZ-b"	Dùng `WaitForFirstConsumer`
EFS: SG chặn NFS	"mount timeout"	Mở port 2049

6.2. Volume mount failed

kubectl describe pod <pod-name> | grep -A 10 "Events"
# "Unable to attach or mount volumes"
# "FailedMount"

Nguyên nhân	Giải pháp
Volume đang attached vào Node khác	Chờ force detach hoặc xóa Pod cũ
Filesystem corrupted	Tạo volume mới từ snapshot
Permission denied	Set `fsGroup` trong `securityContext`

# Fix permission denied
spec:
  securityContext:
    fsGroup: 1000
  containers:
    - name: app
      securityContext:
        runAsUser: 1000

🔄 7. Cluster Upgrade Issues

7.1. Quy trình upgrade an toàn

7.2. Trước khi upgrade

# Check version hiện tại
kubectl version --short
aws eks describe-cluster --name my-eks --query 'cluster.version'

# Check addon compatibility
aws eks describe-addon-versions --kubernetes-version 1.30 --addon-name vpc-cni

# Check deprecated APIs
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis

# Hoặc dùng pluto để scan deprecated APIs
# https://github.com/FairwindsOps/pluto
pluto detect-all-in-cluster

7.3. Các lỗi upgrade thường gặp

Lỗi	Nguyên nhân	Giải pháp
Addon incompatible	Addon version không hỗ trợ K8s mới	Upgrade addon trước khi upgrade cluster
PDB blocking drain	PodDisruptionBudget ngăn drain Node	Tạm thời nới PDB hoặc tăng replicas
Deprecated API	manifest dùng API đã bị xóa	Cập nhật apiVersion trong manifest
Webhook blocking	Admission webhook reject pods	Kiểm tra ValidatingWebhookConfiguration

7.4. Upgrade Node Group (rolling update)

# Check node group version
aws eks describe-nodegroup \
  --cluster-name my-eks \
  --nodegroup-name my-workers \
  --query 'nodegroup.version'

# Upgrade node group
aws eks update-nodegroup-version \
  --cluster-name my-eks \
  --nodegroup-name my-workers \
  --kubernetes-version 1.30

# Monitor progress
aws eks describe-update \
  --cluster-name my-eks \
  --nodegroup-name my-workers \
  --update-id <update-id>

Lưu ý khi upgrade

Chỉ upgrade lên 1 minor version mỗi lần (1.29 → 1.30, không 1.28 → 1.30).
Upgrade Control Plane trước, sau đó Addons, cuối cùng Node Groups.
Luôn test trên staging trước production.
Đảm bảo mọi Deployment có PodDisruptionBudget.

🛠️ 8. Useful Debugging Tools

8.1. kubectl debug

Tạo ephemeral container để debug Pod (không cần sửa Pod spec):

# Debug Pod đang chạy
kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container-name>

# Debug Node
kubectl debug node/<node-name> -it --image=ubuntu

8.2. kubectl-node-shell

SSH vào Node không cần SSH key:

# Install
kubectl krew install node-shell

# Sử dụng
kubectl node-shell <node-name>

8.3. Netshoot — Network debugging

nicolaka/netshoot chứa đầy đủ network tools (curl, dig, tcpdump, nmap,...):

# Chạy debug Pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

# Từ trong Pod:
curl -v http://my-service.default:80
dig my-service.default.svc.cluster.local
tcpdump -i eth0 port 80

8.4. Stern — Multi-pod log tailing

# Install
brew install stern

# Tail log tất cả Pod matching pattern
stern "my-app-.*" -n default

# Tail log kèm container name
stern "my-app-.*" -n default --container app

# Tail log từ nhiều namespace
stern ".*" -n default -n staging --since 5m

8.5. k9s — Terminal UI

# Install
brew install k9s

# Chạy
k9s --context my-eks-context

# Shortcut hữu ích:
# :pods     → xem pods
# :svc      → xem services
# :events   → xem events
# :logs     → xem logs
# /keyword  → filter
# d         → describe
# l         → logs
# s         → shell exec

📋 9. Troubleshooting Checklist

Quick Reference

Pod Pending?
  → kubectl describe pod → check Events
  → kubectl describe nodes → check resources
  → kubectl get pvc → check volume binding

Pod CrashLoopBackOff?
  → kubectl logs <pod> --previous
  → kubectl describe pod → check exit code
  → kubectl top pod → check OOM

Networking fail?
  → kubectl exec → nslookup, curl
  → kubectl get endpoints → check service binding
  → kubectl get pods -n kube-system → check CoreDNS, kube-proxy

Authentication fail?
  → aws sts get-caller-identity
  → kubectl get cm aws-auth -n kube-system
  → aws eks list-access-entries

Storage fail?
  → kubectl describe pvc → check events
  → kubectl get sc → check StorageClass exists
  → kubectl get pods -n kube-system → check CSI driver pods

10. Tổng kết

Troubleshooting EKS hiệu quả cần nắm rõ:

Pod issues - Pending, CrashLoopBackOff, ImagePullBackOff, OOMKilled
Node issues - NotReady, không join cluster, hết IP
Networking - DNS, Service connectivity, Load Balancer
IAM - kubectl auth, IRSA
Storage - PVC Pending, mount failed
Upgrade - Compatibility check, rolling update

Các công cụ hữu ích: kubectl debug, netshoot, stern, k9s, pluto.

Tham khảo:

📌 1. Quy trình Debug tổng quan​

1.1. Mindset khi troubleshoot​

1.2. Các lệnh debug cơ bản​

🔴 2. Pod Issues​

2.1. Pod Pending​

2.2. CrashLoopBackOff​

2.3. ImagePullBackOff​

2.4. OOMKilled​

🖥️ 3. Node Issues​

3.1. Node NotReady​

3.2. Node không join cluster​

3.3. Không đủ IP addresses​

🌐 4. Networking Issues​

4.1. Pod không kết nối được ra ngoài (Internet)​

4.2. Service không kết nối được giữa các Pod​

4.3. CoreDNS Issues​

4.4. Load Balancer không hoạt động​

🔑 5. IAM & Authentication Issues​

5.1. Không thể kubectl vào cluster​

5.2. IRSA (IAM Roles for Service Accounts) không hoạt động​

💾 6. Storage Issues​

6.1. PVC Pending​

6.2. Volume mount failed​

🔄 7. Cluster Upgrade Issues​

7.1. Quy trình upgrade an toàn​

7.2. Trước khi upgrade​

7.3. Các lỗi upgrade thường gặp​

7.4. Upgrade Node Group (rolling update)​

🛠️ 8. Useful Debugging Tools​

8.1. kubectl debug​

8.2. kubectl-node-shell​

8.3. Netshoot — Network debugging​

8.4. Stern — Multi-pod log tailing​

8.5. k9s — Terminal UI​

📋 9. Troubleshooting Checklist​

Quick Reference​

10. Tổng kết​