Phần 13 — Cluster API, FinOps và AI/ML workloads trên Kubernetes

Series Kubernetes Toàn Tập — 13 phần:
Phần 1 — Tổng quan và kiến trúc
Phần 2 — Cài đặt cluster
Phần 3 — Workloads
Phần 4 — Networking
Phần 5 — Storage
Phần 6 — Configuration
Phần 7 — Security
Phần 8 — Scheduling & Autoscaling
Phần 9 — Helm, Operator, CRD
Phần 10 — Production
Phần 11 — Service Mesh
Phần 12 — Multi-tenancy & Multi-cluster
Phần 13 — Cluster API, FinOps, AI/ML ← bạn đang đọc

Mở đầu

Ba chủ đề đang nóng của K8s ecosystem hiện nay: quản lý cluster bằng K8s API (Cluster API), FinOps (kiểm soát chi phí), và AI/ML workloads (LLM, training, inference). Đây là phần cuối series — gom các chủ đề nâng cao mà DevOps đương đại nên nắm.

1. Cluster API (CAPI) — cluster as Kubernetes object

Triết lý CAPI: nếu K8s tốt cho quản lý app, hãy dùng nó quản lý chính K8s.

1.1 Kiến trúc

┌──────────────────────────────┐
│   Management Cluster         │
│   (controllers)              │
│                              │
│  - cluster-api               │
│  - cluster-api-provider-aws  │ ← infrastructure provider
│  - kubeadm-control-plane     │ ← control-plane provider
│  - kubeadm-bootstrap         │ ← bootstrap provider
└──────┬───────────────────────┘
       │ reconcile
       ▼
┌──────────────────────────────┐
│   Workload Cluster(s)        │
│   (target — provisioned)     │
└──────────────────────────────┘

3 loại provider:

Bootstrap — gen userdata cho node (kubeadm thường dùng).
Control plane — quản lý etcd, apiserver, scheduler (KubeadmControlPlane, hoặc managed).
Infrastructure — gọi cloud API (AWS, Azure, GCP, vSphere, MAAS, bare metal Talos, Tinkerbell…).

1.2 Cluster object

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-us-east-1
  namespace: clusters
spec:
  clusterNetwork:
    pods:
      cidrBlocks: ["10.244.0.0/16"]
    services:
      cidrBlocks: ["10.96.0.0/12"]
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: prod-us-east-1-cp
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: prod-us-east-1

1.3 MachineDeployment — worker pool

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: prod-us-east-1-workers
spec:
  clusterName: prod-us-east-1
  replicas: 5
  template:
    spec:
      version: v1.31.0
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: prod-us-east-1-workers
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
        kind: AWSMachineTemplate
        name: prod-us-east-1-workers

Edit replicas: 10 → CAPI provision thêm 5 EC2 instance, join cluster. Edit version: v1.32.0 → rolling upgrade.

1.4 So sánh với alternatives

Tool	Mô hình	Khi nên dùng
Cluster API	K8s-native CRD, declarative	Multi-cluster, GitOps, self-service
Terraform / Pulumi	IaC ngoài K8s	1–vài cluster, đã có IaC pipeline
eksctl / az aks / gcloud	CLI wrapper	Quick start, ít cluster
Crossplane	K8s API cho cloud resource	Quản lý cả cloud lẫn cluster qua K8s API
Rancher	Platform UI + CAPI bên dưới	Operator non-K8s native

1.5 Best practices CAPI

Management cluster nhỏ (3 node) nhưng không tự quản lý chính nó — gặp lỗi không cứu được.
Backup etcd management cluster mỗi giờ.
Lock version provider để upgrade không phá bất ngờ.
GitOps Cluster CR — Argo CD apply Cluster, MachineDeployment.
Đặt taint node-role.kubernetes.io/control-plane cho workload không trộn lên CP node.

2. FinOps — kiểm soát chi phí K8s

2.1 Vì sao K8s tốn nhiều hơn dự kiến

Over-provisioning requests gấp 3–5 lần usage thực.
Idle node chiếm CPU thật ít, vẫn tính tiền.
Cross-AZ data transfer trong cluster — 0.01 USD/GB nhân lên đáng kể.
LoadBalancer per service — mỗi LB cloud 15–25 USD/tháng.
Storage EBS không xoá sau khi PVC delete (Retain policy).
Snapshot backup chồng chất.

2.2 OpenCost / Kubecost

OpenCost (CNCF) chia chi phí cloud sang namespace/pod/team. Kubecost là sản phẩm thương mại bên trên.

helm upgrade --install opencost opencost \
  --repo https://opencost.github.io/opencost-helm-chart \
  --namespace opencost --create-namespace

Sau khi cài, query cost theo:

Namespace, deployment, container.
Compute (CPU/RAM/GPU), storage (PV), network egress.
On-demand vs spot.

Mapping requests sang $$ giúp:

Charge-back team theo namespace.
Phát hiện workload over-request (paying for unused).
Show-back cost cho dev tự cải thiện.

2.3 Karpenter consolidation

Đã đề cập Phần 8. Karpenter tự move pod xuống ít node hơn khi cluster ít util, kill node thừa. Đây là cách dễ nhất giảm 30–50% cost compute.

2.4 Spot/preemptible

EKS với Karpenter có thể mix on-demand + spot. Pattern phổ biến:

Stateless app, batch, CI runner → spot (giảm 70–90% giá).
Database, control plane, ingress → on-demand.
PodDisruptionBudget bảo vệ workload quan trọng.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]   # cho phép Graviton, rẻ hơn ~20%

2.5 Right-sizing — vũ khí lớn nhất

VPA recommend mode đề xuất request đúng:

spec:
  updatePolicy:
    updateMode: Off    # chỉ recommend, không tự đổi
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      controlledResources: ["cpu", "memory"]

kubectl get vpa -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: cpu={.status.recommendation.containerRecommendations[0].target.cpu} mem={.status.recommendation.containerRecommendations[0].target.memory}{"\n"}{end}'

2.6 Cost dashboard có gì

Tổng cost cluster tháng, so với tháng trước.
Cost theo namespace / team — bảng xếp hạng.
Top 10 deployment tốn nhất.
Idle %: requests trừ usage.
Egress cost theo destination AZ/region.

2.7 Alert FinOps đáng có

Cost ngày > 110% baseline 7 ngày qua.
PVC ≥ 30 ngày chưa attach pod nào — orphan.
LoadBalancer ≥ 7 ngày 0 traffic.
Node utilization (real) < 30% trên 24h — under-utilized.

3. AI/ML workloads — K8s là platform mới của ML

3.1 GPU scheduling

K8s không tự biết về GPU. NVIDIA Device Plugin DaemonSet expose nvidia.com/gpu resource:

resources:
  requests:
    cpu: "8"
    memory: "32Gi"
    nvidia.com/gpu: 1
  limits:
    nvidia.com/gpu: 1

Tài liệu: NVIDIA/k8s-device-plugin. Mỗi pod chiếm trọn GPU. Để chia 1 GPU cho nhiều pod:

MIG (Multi-Instance GPU, A100/H100) — split phần cứng.
Time slicing — context switch trong driver.
vGPU (NVIDIA vGPU enterprise) — share license.

3.2 GPU node pool tách biệt

GPU node ~ 3–10× giá CPU node. Dùng taint + toleration để chỉ ML workload chạy:

kubectl taint node gpu-1 nvidia.com/gpu=true:NoSchedule

tolerations:
- key: nvidia.com/gpu
  operator: Exists
nodeSelector:
  node.kubernetes.io/instance-type: g5.xlarge

3.3 Kubeflow — ML platform full

Kubeflow gom mọi component cho ML lifecycle trên K8s:

Component	Vai trò
Notebooks	Jupyter / VS Code per user
Pipelines	Workflow DAG (Argo Workflows bên dưới)
Training Operator	PyTorchJob, TFJob, MPIJob — distributed training
Katib	Hyperparameter tuning
KServe	Model serving (autoscale, canary)
Feast	Feature store

3.4 KServe — serve model với canary, autoscale-to-zero

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
spec:
  predictor:
    minReplicas: 0           # scale-to-zero khi không có request
    maxReplicas: 5
    scaleTarget: 60
    scaleMetric: concurrency
    sklearn:
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"

KServe dùng Knative bên dưới — không có request thì pod scale về 0; có request lại cold start.

3.5 LLM inference — vLLM, TGI, Text Generation Inference

Để serve LLM (Llama, Mistral, Qwen, GPT-OSS):

vLLM — engine inference cao throughput, PagedAttention.
TGI (HuggingFace) — production-grade.
Ollama — local dev.
NVIDIA NIM — official, optimization.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
spec:
  replicas: 2
  selector:
    matchLabels: { app: vllm }
  template:
    metadata:
      labels: { app: vllm }
    spec:
      nodeSelector:
        nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args:
        - --model=meta-llama/Llama-3-70B-Instruct
        - --tensor-parallel-size=2
        - --max-model-len=8192
        resources:
          limits:
            nvidia.com/gpu: 2
            memory: 200Gi
        ports:
        - containerPort: 8000

3.6 Batch và training

Argo Workflows — DAG workflow engine, mạnh nhất K8s ecosystem.
Volcano — batch scheduler tối ưu cho gang scheduling (mọi pod cùng start), priority queue.
Kueue — job queueing chính thức SIG.
Ray — distributed Python, dùng KubeRay operator.
JobSet — group nhiều Job lại, dùng cho distributed training.

3.7 Gang scheduling

ML distributed training cần all-or-none: nếu chỉ 7/8 worker pod schedule được, 7 pod kia chạy vô ích. Default scheduler K8s không có concept này — cần Volcano hoặc Kueue.

3.8 Storage cho ML

Dataset lớn → object storage (S3, GCS) + cache local NVMe.
Model checkpoint → PVC ReadWriteMany (Filestore/EFS/CSI).
Cache shared giữa pod cùng node → DaemonSet sidecar mount hostPath.
JuiceFS, Alluxio — cache layer phía trước object storage.

3.9 Network cho ML

Distributed training: traffic AllReduce giữa worker rất nặng. Cần:

Node trong cùng placement group (AWS) hoặc cùng top-of-rack.
RDMA / EFA cho NCCL ≥ 100Gbps.
Pod schedule cùng node nếu < 8 GPU, cùng rack nếu > 8 GPU.

4. Edge Kubernetes

Khi cần chạy K8s ở edge (factory, retail, IoT):

k3s — single binary < 100MB, SQLite mặc định.
KubeEdge — extend K8s đến edge, có cloud-core và edge-core.
OpenYurt — Alibaba, edge autonomy khi mất kết nối.
MicroK8s — Canonical, snap package.

Lựa chọn phụ thuộc: số edge site, kết nối, OS, ngân sách RAM/CPU edge device.

5. WebAssembly trên K8s

WASI workload đang lan — chạy WebAssembly module trực tiếp trên kubelet qua runwasi + containerd-shim-wasm. Ưu điểm:

Start < 10ms (vs container ~ 500ms).
Image < 10MB.
Sandbox chặt từ ngôn ngữ.

Use case: edge function, plugin system (Envoy filter), per-request isolation. Còn sớm để dùng đại trà.

6. Pattern AI/ML thực chiến

6.1 Online inference low-latency

Client → Ingress (TLS, rate limit)
       → KServe / Deployment vLLM (HPA theo concurrency)
       → Model loaded vào GPU memory
       → Response trong ms
Cache: Redis trước backend, Semantic cache với embedding.

6.2 Batch offline processing

Trigger (Argo CronWorkflow / EventBridge)
  → Argo Workflow DAG
      ├ Step 1: download data S3
      ├ Step 2: preprocess (Spark/Beam on K8s)
      ├ Step 3: feature engineering
      └ Step 4: training (PyTorchJob) → save model S3
  → Trigger KServe canary deploy version mới

6.3 RAG (Retrieval-Augmented Generation)

Doc store (S3) → Embedding pipeline (Argo) → Vector DB (Qdrant/Weaviate/pgvector)
                                              ↑
User query → Embed → ANN search → context → vLLM → trả lời

Mỗi component là K8s workload: vector DB (StatefulSet với PVC NVMe), pipeline (Job), LLM serve (Deployment + GPU node pool), gateway (Ingress).

7. Bảo mật AI workload

Model có thể chứa data nhạy cảm — encryption at rest cho PVC chứa model.
Inference endpoint cần rate limit, auth (JWT), audit log mọi request (input/output).
Prompt injection: validate input, sanitize output, không trust output làm action.
Tránh leak system prompt / model weight: NetworkPolicy chặn egress arbitrary từ inference pod.
GPU memory không zero-out giữa pod — không share GPU cross-tenant trong cluster đa tenant.

8. Tóm tắt

Cluster API mang K8s declarative pattern lên chính việc tạo cluster — multi-cluster GitOps thật sự.
FinOps: opencost cho visibility, Karpenter cho consolidation, spot cho stateless, right-sizing với VPA recommend.
AI/ML: GPU device plugin, node pool tách, Kubeflow/KServe, vLLM cho LLM, gang scheduling cho training.
Edge: k3s, KubeEdge, OpenYurt.
WASM workload đang nổi.

9. Kết series — đường còn dài

13 phần đã đi:

Khái niệm + kiến trúc
Cài cluster
Workloads
Networking
Storage
Configuration
Security
Scheduling + Autoscaling
Helm + Operator + CRD
Production
Service Mesh
Multi-tenancy + Multi-cluster
Cluster API + FinOps + AI/ML

Bạn có nền tảng vững để vận hành K8s production. Tiếp theo, từng nhánh đều có cộng đồng riêng — tham gia Slack kubernetes.slack.com, đọc KEP (Kubernetes Enhancement Proposal) để biết hướng đi của project, theo dõi CNCF graduated/incubating list để chọn tool đáng đầu tư.

Build it, break it, fix it, rinse and repeat.

← Phần 12 | ↺ Về Phần 1

Mở đầu

1. Cluster API (CAPI) — cluster as Kubernetes object

1.1 Kiến trúc

1.2 Cluster object

1.3 MachineDeployment — worker pool

1.4 So sánh với alternatives

1.5 Best practices CAPI

2. FinOps — kiểm soát chi phí K8s

2.1 Vì sao K8s tốn nhiều hơn dự kiến

2.2 OpenCost / Kubecost

2.3 Karpenter consolidation

2.4 Spot/preemptible

2.5 Right-sizing — vũ khí lớn nhất

2.6 Cost dashboard có gì

2.7 Alert FinOps đáng có

3. AI/ML workloads — K8s là platform mới của ML

3.1 GPU scheduling

3.2 GPU node pool tách biệt

3.3 Kubeflow — ML platform full

3.4 KServe — serve model với canary, autoscale-to-zero

3.5 LLM inference — vLLM, TGI, Text Generation Inference

3.6 Batch và training

3.7 Gang scheduling

3.8 Storage cho ML

3.9 Network cho ML

4. Edge Kubernetes

5. WebAssembly trên K8s

6. Pattern AI/ML thực chiến

6.1 Online inference low-latency

6.2 Batch offline processing

6.3 RAG (Retrieval-Augmented Generation)

7. Bảo mật AI workload

8. Tóm tắt

9. Kết series — đường còn dài

Bài viết liên quan

Phần 12 — Multi-tenancy và Multi-cluster: chia sẻ K8s an toàn ở quy mô

Phần 11 — Service Mesh: Istio, Linkerd và Cilium Service Mesh

Phần 10 — Production: Observability, Backup, Upgrade và Disaster Recovery

Ý kiến

Bài viết liên quan

Phần 12 — Multi-tenancy và Multi-cluster: chia sẻ K8s an toàn ở quy mô

Phần 11 — Service Mesh: Istio, Linkerd và Cilium Service Mesh

Phần 10 — Production: Observability, Backup, Upgrade và Disaster Recovery

Mở đầu

1. Cluster API (CAPI) — cluster as Kubernetes object

1.1 Kiến trúc

1.2 Cluster object

1.3 MachineDeployment — worker pool

1.4 So sánh với alternatives

1.5 Best practices CAPI

2. FinOps — kiểm soát chi phí K8s

2.1 Vì sao K8s tốn nhiều hơn dự kiến

2.2 OpenCost / Kubecost

2.3 Karpenter consolidation

2.4 Spot/preemptible

2.5 Right-sizing — vũ khí lớn nhất

2.6 Cost dashboard có gì

2.7 Alert FinOps đáng có

3. AI/ML workloads — K8s là platform mới của ML

3.1 GPU scheduling

3.2 GPU node pool tách biệt

3.3 Kubeflow — ML platform full

3.4 KServe — serve model với canary, autoscale-to-zero

3.5 LLM inference — vLLM, TGI, Text Generation Inference

3.6 Batch và training

3.7 Gang scheduling

3.8 Storage cho ML

3.9 Network cho ML

4. Edge Kubernetes

5. WebAssembly trên K8s

6. Pattern AI/ML thực chiến

6.1 Online inference low-latency

6.2 Batch offline processing

6.3 RAG (Retrieval-Augmented Generation)

7. Bảo mật AI workload

8. Tóm tắt

9. Kết series — đường còn dài