EKS Monitoring & Observability

Phần này trình bày các giải pháp monitoring và observability cho EKS: từ CloudWatch Container Insights (AWS-native) đến Prometheus + Grafana stack, logging với Fluent Bit, và distributed tracing.

📌 1. Tổng quan Observability trong EKS

Observability gồm 3 trụ cột chính:

Trụ cột	Mục đích	Công cụ phổ biến
Metrics	Giám sát CPU, Memory, request rate, latency	Prometheus, CloudWatch
Logs	Thu thập và phân tích log từ container/node	Fluent Bit, CloudWatch Logs
Traces	Theo dõi request flow qua các microservices	X-Ray, OpenTelemetry, Jaeger

☁️ 2. CloudWatch Container Insights

2.1. Container Insights là gì?

CloudWatch Container Insights là giải pháp monitoring AWS-native, thu thập metrics và logs từ EKS cluster mà không cần cài thêm Prometheus/Grafana. Phù hợp cho team muốn dùng hoàn toàn hệ sinh thái AWS.

2.2. Dữ liệu thu thập

Cluster-level: Số Node, số Pod, CPU/Memory tổng
Node-level: CPU, Memory, Disk, Network per Node
Pod-level: CPU, Memory, Network per Pod
Container-level: CPU, Memory, Restart count per Container

2.3. Cài đặt CloudWatch Agent (enhanced observability)

Bật enhanced observability khi tạo cluster:

aws eks create-addon \
  --cluster-name my-eks \
  --addon-name amazon-cloudwatch-observability \
  --configuration-values '{
    "agent": {
      "config": {
        "logs": {
          "metrics_collected": {
            "app_signals": {},
            "kubernetes": {
              "enhanced_container_insights": true
            }
          }
        }
      }
    }
  }'

Hoặc cài bằng Terraform:

resource "aws_eks_addon" "cloudwatch_observability" {
  cluster_name  = var.cluster_name
  addon_name    = "amazon-cloudwatch-observability"
  addon_version = "v2.1.0-eksbuild.1"

  configuration_values = jsonencode({
    agent = {
      config = {
        logs = {
          metrics_collected = {
            kubernetes = {
              enhanced_container_insights = true
            }
          }
        }
      }
    }
  })
}

2.4. IAM Policy cần thiết

Node role cần attach policy CloudWatchAgentServerPolicy:

aws iam attach-role-policy \
  --role-name MyEKSNodeRole \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

2.5. Xem metrics trên CloudWatch

Sau khi cài đặt, vào CloudWatch → Container Insights để xem:

Performance monitoring dashboard
Map view (topology các service)
Alarm tự động khi CPU/Memory vượt ngưỡng

Khi nào dùng Container Insights?

Team nhỏ, muốn setup nhanh không cần quản lý thêm infrastructure.
Đã sử dụng CloudWatch cho các dịch vụ AWS khác.
Cần tích hợp với CloudWatch Alarms và SNS notifications.
Lưu ý: Chi phí tính theo số lượng metrics và log data ingested.

📈 3. Prometheus + Grafana Stack

3.1. Tổng quan

Prometheus + Grafana là bộ đôi monitoring phổ biến nhất trong hệ sinh thái Kubernetes:

Prometheus - Thu thập và lưu trữ metrics dạng time-series
Grafana - Hiển thị dashboard, visualization
Alertmanager - Quản lý alerting rules, gửi notification

3.2. Cài đặt bằng kube-prometheus-stack

kube-prometheus-stack là Helm chart bao gồm tất cả: Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=your-secure-password \
  --set prometheus.prometheusSpec.retention=15d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

3.3. Cấu hình bằng Terraform

resource "helm_release" "kube_prometheus" {
  name       = "kube-prometheus"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  namespace  = "monitoring"

  create_namespace = true

  values = [
    file("${path.module}/values/prometheus-values.yaml")
  ]
}

File prometheus-values.yaml:

grafana:
  adminPassword: your-secure-password
  persistence:
    enabled: true
    size: 10Gi
  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: alb
      alb.ingress.kubernetes.io/scheme: internal
    hosts:
      - grafana.internal.example.com

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

alertmanager:
  config:
    route:
      receiver: slack
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    receivers:
      - name: slack
        slack_configs:
          - api_url: https://hooks.slack.com/services/xxx/yyy/zzz
            channel: '#eks-alerts'
            title: '{{ .CommonAnnotations.summary }}'

3.4. ServiceMonitor - Thu thập metrics từ ứng dụng

Để Prometheus scrape metrics từ ứng dụng, tạo ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  namespaceSelector:
    matchNames:
      - default
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

3.5. Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: eks-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  groups:
    - name: pod-alerts
      rules:
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} đang crash loop"
            description: "Pod {{ $labels.pod }} trong namespace {{ $labels.namespace }} đã restart nhiều lần trong 15 phút."

        - alert: HighCPUUsage
          expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, namespace) / sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, namespace)) > 0.9
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.pod }} CPU > 90%"

        - alert: HighMemoryUsage
          expr: (sum(container_memory_working_set_bytes{container!=""}) by (pod, namespace) / sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, namespace)) > 0.85
          for: 10m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} Memory > 85%"

3.6. Truy cập Grafana

# Port-forward để truy cập local
kubectl port-forward svc/kube-prometheus-grafana 3000:80 -n monitoring

Grafana đi kèm nhiều dashboard có sẵn:

Kubernetes / Compute Resources / Cluster - Tổng quan CPU/Memory cluster
Kubernetes / Compute Resources / Namespace (Pods) - Chi tiết theo namespace
Kubernetes / Networking / Cluster - Network traffic
Node Exporter - Chi tiết từng EC2 Node

📝 4. Logging với Fluent Bit

4.1. Tại sao cần centralized logging?

Container logs mặc định chỉ lưu trên Node (/var/log/containers/). Khi Pod bị terminate hoặc Node bị thay thế, log sẽ mất. Cần centralized logging để:

Lưu trữ log lâu dài
Tìm kiếm và phân tích log tập trung
Correlate log giữa các microservices

4.2. Fluent Bit là gì?

Fluent Bit là log processor nhẹ, chạy dưới dạng DaemonSet trên mỗi Node. So với Fluentd, Fluent Bit tiêu tốn ít resource hơn (khoảng 50MB RAM).

4.3. Cài đặt Fluent Bit cho CloudWatch Logs

Dùng AWS EKS Addon:

aws eks create-addon \
  --cluster-name my-eks \
  --addon-name aws-for-fluent-bit \
  --addon-version v2.31.12-eksbuild.1

Hoặc Helm:

helm repo add fluent https://fluent.github.io/helm-charts
helm repo update

helm install fluent-bit fluent/fluent-bit \
  --namespace logging \
  --create-namespace \
  --values fluent-bit-values.yaml

File fluent-bit-values.yaml:

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/FluentBitRole

config:
  inputs: |
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            cri
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
        Refresh_Interval  10

  filters: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On

    [FILTER]
        Name    grep
        Match   kube.*
        Exclude log ^$

  outputs: |
    [OUTPUT]
        Name                cloudwatch_logs
        Match               kube.*
        region              ap-southeast-1
        log_group_name      /eks/my-eks/containers
        log_stream_prefix   fluentbit-
        auto_create_group   true
        log_retention_days  30

4.4. IAM Policy cho Fluent Bit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/eks/*"
    }
  ]
}

Best Practices cho Logging

Dùng structured logging (JSON) trong application để dễ query.
Set log_retention_days để tránh chi phí lưu trữ vô hạn.
Dùng grep filter để loại bỏ log rỗng, giảm noise.
Với log lượng lớn, gửi vào S3 rồi query bằng Athena thay vì CloudWatch.

🔍 5. Distributed Tracing

5.1. Tại sao cần Tracing?

Trong kiến trúc microservices, một request có thể đi qua 5-10 service. Khi xảy ra lỗi hoặc latency cao, cần tracing để xác định service nào gây chậm.

5.2. AWS X-Ray

X-Ray là dịch vụ tracing AWS-native, tích hợp sẵn với các AWS services.

Cài đặt X-Ray Daemon trên EKS:

aws eks create-addon \
  --cluster-name my-eks \
  --addon-name adot \
  --addon-version v0.92.1-eksbuild.1

5.3. OpenTelemetry Collector (ADOT)

AWS Distro for OpenTelemetry (ADOT) là bản phân phối OpenTelemetry do AWS quản lý, hỗ trợ gửi traces/metrics đến nhiều backend.

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: adot-collector
  namespace: monitoring
spec:
  mode: deployment
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 10s
        send_batch_size: 1024

    exporters:
      awsxray:
        region: ap-southeast-1
      awsemf:
        region: ap-southeast-1
        namespace: EKS/MyApp

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [awsxray]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [awsemf]

5.4. Instrument ứng dụng

Ví dụ auto-instrumentation cho ứng dụng Python:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: python-instrumentation
  namespace: default
spec:
  exporter:
    endpoint: http://adot-collector.monitoring:4317
  propagators:
    - tracecontext
    - baggage
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest

Thêm annotation vào Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-python-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "true"
    spec:
      containers:
        - name: app
          image: my-python-app:latest

🚨 6. Alerting & Notification

6.1. CloudWatch Alarms

aws cloudwatch put-metric-alarm \
  --alarm-name eks-high-cpu \
  --namespace ContainerInsights \
  --metric-name pod_cpu_utilization \
  --dimensions Name=ClusterName,Value=my-eks \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:ap-southeast-1:123456789:eks-alerts

6.2. Alertmanager Routes (Prometheus stack)

Cấu hình multi-channel alerting:

alertmanager:
  config:
    route:
      receiver: default
      group_by: ['alertname', 'namespace']
      routes:
        - match:
            severity: critical
          receiver: pagerduty
        - match:
            severity: warning
          receiver: slack

    receivers:
      - name: default
        slack_configs:
          - channel: '#eks-alerts'

      - name: slack
        slack_configs:
          - api_url: https://hooks.slack.com/services/xxx
            channel: '#eks-warnings'
            title: '[{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
            text: '{{ .CommonAnnotations.description }}'

      - name: pagerduty
        pagerduty_configs:
          - service_key: your-pagerduty-key

📊 7. So sánh các giải pháp

Metrics

Giải pháp	Managed	Chi phí	Tính năng	Phù hợp
CloudWatch Container Insights	✔	Theo data ingested	Dashboard có sẵn, Alarms	Team nhỏ, AWS-native
Amazon Managed Prometheus (AMP)	✔	Theo metrics ingested	PromQL, tương thích Prometheus	Scale lớn, multi-cluster
Self-hosted Prometheus	❌	EC2/EBS storage	Toàn quyền kiểm soát	Team có kinh nghiệm K8s

Logging

Giải pháp	Managed	Chi phí	Tính năng	Phù hợp
CloudWatch Logs	✔	Theo data ingested	Logs Insights query	Đơn giản, tích hợp AWS
Amazon OpenSearch	✔	Theo instance + storage	Full-text search, dashboard	Log lượng lớn, cần search
S3 + Athena	✔	Rất thấp	SQL query	Archive, query không thường xuyên

Tracing

Giải pháp	Managed	Chi phí	Tính năng	Phù hợp
AWS X-Ray	✔	Theo traces recorded	Service map, analytics	AWS-native
Jaeger	❌	Self-hosted	OpenTracing compatible	Team dùng open source
ADOT + X-Ray	✔	Theo traces	OpenTelemetry standard	Kết hợp tốt nhất

🏗️ 8. Kiến trúc Monitoring khuyến nghị cho Production

Stack cơ bản (cost-effective)

Stack nâng cao (full observability)

Checklist Monitoring cho Production

✔ Metrics: CloudWatch Container Insights hoặc Prometheus đã cài
✔ Logging: Fluent Bit DaemonSet chạy trên tất cả Node
✔ Log retention: Đã set policy (30-90 ngày)
✔ Alerting: Đã cấu hình alert cho CPU, Memory, Pod restart, Node Not Ready
✔ Dashboard: Grafana hoặc CloudWatch dashboard cho từng namespace/service
✔ Tracing: ADOT hoặc X-Ray cho critical services
✔ On-call: Alert gửi đến đúng channel (Slack, PagerDuty, Email)
✔ Runbook: Mỗi alert có runbook hướng dẫn xử lý

9. Tổng kết

Monitoring & Observability trong EKS gồm 3 trụ cột:

Metrics - CloudWatch Container Insights (nhanh, dễ) hoặc Prometheus + Grafana (mạnh, linh hoạt)
Logs - Fluent Bit gửi về CloudWatch Logs hoặc OpenSearch
Traces - ADOT + X-Ray cho distributed tracing

Chọn giải pháp phù hợp với quy mô team và chi phí. Bắt đầu với CloudWatch Container Insights, sau đó nâng cấp lên Prometheus + Grafana khi cluster phát triển.

Tham khảo:

📌 1. Tổng quan Observability trong EKS​

☁️ 2. CloudWatch Container Insights​

2.1. Container Insights là gì?​

2.2. Dữ liệu thu thập​

2.3. Cài đặt CloudWatch Agent (enhanced observability)​

2.4. IAM Policy cần thiết​

2.5. Xem metrics trên CloudWatch​

📈 3. Prometheus + Grafana Stack​

3.1. Tổng quan​

3.2. Cài đặt bằng kube-prometheus-stack​

3.3. Cấu hình bằng Terraform​

3.4. ServiceMonitor - Thu thập metrics từ ứng dụng​

3.5. Alerting Rules​

3.6. Truy cập Grafana​

📝 4. Logging với Fluent Bit​

4.1. Tại sao cần centralized logging?​

4.2. Fluent Bit là gì?​

4.3. Cài đặt Fluent Bit cho CloudWatch Logs​

4.4. IAM Policy cho Fluent Bit​

🔍 5. Distributed Tracing​

5.1. Tại sao cần Tracing?​

5.2. AWS X-Ray​

5.3. OpenTelemetry Collector (ADOT)​

5.4. Instrument ứng dụng​

🚨 6. Alerting & Notification​

6.1. CloudWatch Alarms​

6.2. Alertmanager Routes (Prometheus stack)​

📊 7. So sánh các giải pháp​

Metrics​

Logging​

Tracing​

🏗️ 8. Kiến trúc Monitoring khuyến nghị cho Production​

Stack cơ bản (cost-effective)​

Stack nâng cao (full observability)​

Checklist Monitoring cho Production​

9. Tổng kết​