[k8s] Troubleshooting：從 Pending 到 CrashLoopBackOff

K8s 最讓新手崩潰的不是學概念，而是 apply 之後 Pod 就是起不來，然後你不知道為什麼。Pending、CrashLoopBackOff、ImagePullBackOff、CreateContainerConfigError——這些狀態每個都有特定的原因，只要知道去哪裡看，90% 的問題 5 分鐘內可以定位。

先講結論

K8s troubleshooting 三板斧：kubectl describe pod（看 Events 和狀態）、kubectl logs（看 app 日誌）、kubectl get events --sort-by='.lastTimestamp'（看叢集事件）。先看 events 判斷問題類型，再用 describe 看細節，最後用 logs 看 app 層的錯誤。養成這個習慣，大部分問題都能自己解。

Pod 狀態機

                ┌──────────┐
                │ Pending  │ ← 等排程 / 等資源 / 等 PVC
                └────┬─────┘
                     │ scheduled + container started
                     ▼
              ┌──────────────┐
              │   Running    │ ← 正常運行中
              └───┬──────┬───┘
                  │      │
         成功結束 │      │ 失敗 / 被殺
                  ▼      ▼
           ┌──────────┐ ┌──────────┐
           │Succeeded │ │  Failed  │ ← OOMKilled / exit code != 0
           └──────────┘ └────┬─────┘
                             │ restartPolicy: Always
                             ▼
                      ┌──────────────────┐
                      │ CrashLoopBackOff │ ← 反覆重啟失敗
                      └──────────────────┘

Pending：Pod 排不上去

症狀：Pod 一直卡在 Pending，不進 Running。

原因 1：資源不足

$ kubectl describe pod api-server-xxx
Events:
  Warning  FailedScheduling  0/3 nodes are available:
    3 Insufficient cpu, 2 Insufficient memory.

解法：

降低 Pod 的 resources.requests
加 Node（scale up cluster）
清理佔著資源不用的 Pod

# 看每個 Node 的資源使用量
kubectl top nodes
 
# 看哪些 Pod 佔最多資源
kubectl top pods --all-namespaces --sort-by=memory

原因 2：nodeSelector / affinity 不匹配

Events:
  Warning  FailedScheduling  0/3 nodes are available:
    3 node(s) didn't match Pod's node affinity/selector.

解法：確認 Node 有對應的 label。

# 看 Node 的 label
kubectl get nodes --show-labels
 
# 加 label
kubectl label nodes node-1 disktype=ssd

原因 3：PVC 找不到 PV

Events:
  Warning  FailedScheduling  0/3 nodes are available:
    persistentvolumeclaim "db-data" not found.

解法：

# 檢查 PVC 狀態
kubectl get pvc
 
# 如果 PVC 是 Pending，看 StorageClass 存不存在
kubectl get storageclass
kubectl describe pvc db-data

CrashLoopBackOff：反覆啟動又掛掉

症狀：Pod 狀態在 Running 和 CrashLoopBackOff 之間反覆。

原因 1：App 本身 crash

$ kubectl logs api-server-xxx
Traceback (most recent call last):
  File "app.py", line 15, in <module>
    db.connect(os.environ["DATABASE_URL"])
KeyError: 'DATABASE_URL'

解法：看 log 修 bug。缺環境變數就補 ConfigMap/Secret。

# 看之前掛掉的那個 container 的 log
kubectl logs api-server-xxx --previous
 
# 如果有多個 container
kubectl logs api-server-xxx -c sidecar --previous

原因 2：OOMKilled

$ kubectl describe pod api-server-xxx
    Last State:  Terminated
      Reason:    OOMKilled
      Exit Code: 137

解法：Pod 記憶體超過 resources.limits.memory，被 kernel 殺掉。

# 看 container 的記憶體用量
kubectl top pod api-server-xxx --containers
 
# 加大 memory limit
# 或者修 app 的 memory leak

原因 3：Probe 失敗

Events:
  Warning  Unhealthy  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing    Container api failed liveness probe, will be restarted

解法：

app 啟動太慢 → 加 startupProbe 或調大 initialDelaySeconds
liveness probe 打了太重的端點 → 改成簡單的 healthz
超時太短 → 加大 timeoutSeconds

ImagePullBackOff：拉不到 Image

症狀：Pod 卡在 ImagePullBackOff 或 ErrImagePull。

$ kubectl describe pod api-server-xxx
Events:
  Warning  Failed  Failed to pull image "myapp/api:1.2.0":
    rpc error: code = Unknown desc = failed to pull and unpack image:
    failed to resolve reference: pull access denied

常見原因和解法

原因	解法
Image 名稱 / tag 打錯	確認 image 存在：`docker pull myapp/api:1.2.0`
Private registry 沒認證	建 `kubernetes.io/dockerconfigjson` Secret + `imagePullSecrets`
Registry 網路不通	Node 上 `curl https://ghcr.io/v2/` 測試
Tag 被覆蓋或刪除	用 SHA digest 取代 tag：`myapp/api@sha256:abc123...`

# 確認 imagePullSecrets 有設
spec:
  imagePullSecrets:
    - name: registry-cred
  containers:
    - name: api
      image: ghcr.io/myorg/api:1.2.0

CreateContainerConfigError：設定有問題

症狀：Pod 卡在 CreateContainerConfigError。

Events:
  Warning  Failed  Error: configmap "app-config" not found
  Warning  Failed  Error: secret "db-secret" not found

解法：ConfigMap 或 Secret 不存在。先建 ConfigMap/Secret 再建 Deployment。

# 確認 ConfigMap/Secret 存在
kubectl get configmap -n production
kubectl get secret -n production
 
# 確認名稱和 namespace 對得上
kubectl describe pod api-server-xxx | grep -A5 "Environment"

kubectl 除錯工具箱

1. describe — 看 Pod 的完整狀態和事件

kubectl describe pod <pod-name>
# 重點看最下面的 Events 區塊

2. logs — 看 app 日誌

# 目前的 log
kubectl logs <pod-name>
 
# 上一個 container 的 log（crash 前的）
kubectl logs <pod-name> --previous
 
# 持續追蹤
kubectl logs <pod-name> -f
 
# 多 container Pod 指定 container
kubectl logs <pod-name> -c <container-name>
 
# 看某個 Deployment 的所有 Pod log
kubectl logs deployment/api-server --all-containers

3. exec — 進到 container 裡面看

# 進入 shell
kubectl exec -it <pod-name> -- /bin/sh
 
# 直接跑指令
kubectl exec <pod-name> -- env
kubectl exec <pod-name> -- cat /etc/resolv.conf
kubectl exec <pod-name> -- wget -qO- http://api-service/healthz

4. port-forward — 本地直接打到 Pod/Service

# 把本地 8080 轉到 Pod 的 8080
kubectl port-forward pod/<pod-name> 8080:8080
 
# 或者直接轉到 Service
kubectl port-forward svc/api-service 8080:80

5. top — 看資源用量

# Node 資源
kubectl top nodes
 
# Pod 資源（需要 metrics-server）
kubectl top pods -n production
kubectl top pods -n production --sort-by=memory
kubectl top pods -n production --containers

6. get events — 看叢集事件

# 按時間排序看所有事件
kubectl get events --sort-by='.lastTimestamp'
 
# 只看某個 namespace
kubectl get events -n production --sort-by='.lastTimestamp'
 
# 只看 Warning
kubectl get events --field-selector type=Warning

系統化 Debug 流程

遇到問題不要慌，照這個流程走：

1. kubectl get pods -n <namespace>
   → 看 Pod 狀態是什麼（Pending? CrashLoopBackOff? ImagePullBackOff?）

2. kubectl get events -n <namespace> --sort-by='.lastTimestamp'
   → 看最近發生了什麼事

3. kubectl describe pod <pod-name>
   → 看 Events 區塊，通常直接告訴你原因

4. kubectl logs <pod-name> --previous
   → 如果是 CrashLoopBackOff，看 crash 前的 log

5. kubectl exec -it <pod-name> -- /bin/sh
   → 如果 Pod 活著但行為不對，進去看環境變數、檔案、網路

6. kubectl top pods / kubectl top nodes
   → 如果懷疑是資源問題

快速檢查清單

# 一行看所有有問題的 Pod
kubectl get pods --all-namespaces --field-selector=status.phase!=Running,status.phase!=Succeeded
 
# 看哪些 Pod 一直重啟
kubectl get pods --all-namespaces -o wide | awk '$5 > 3'
 
# 看哪些 Node 有問題
kubectl get nodes -o wide
kubectl describe node <node-name> | grep -A5 "Conditions"
 
# 看 DNS 是否正常
kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -- nslookup api-service

真實案例：k3s 壓測踩坑

在水平擴展壓測裡，我們在 k3s 上跑壓測遇到了幾個典型問題：

Pod Pending：k3s 單 Node 資源不夠，4 個 replica 排不上去 → 改用 2 個 replica + 調低 requests
Connection pool exhaustion：水平擴展後 DB 連線池爆掉 → 不是 K8s 的問題，是 app 設定的問題
coredns 掛了：壓測流量太大，coredns 的記憶體不夠 → 加大 coredns 的 resources

教訓：K8s 的問題有 80% 不是 K8s 的問題，而是 app 的問題被 K8s 放大了。

常見踩坑

kubectl logs 顯示 “no previous log”：Pod 是第一次啟動就掛了，沒有 --previous 可看。用 describe 看 Events。

exec 進不去（container 一直 crash）：在 Pod spec 加 command: ["sleep", "infinity"] 讓它不要執行原本的 entrypoint，先進去 debug，修好再改回來。

port-forward 很慢或斷線：這是已知問題，不適合當 production 的存取方式。只拿來 debug 用。

events 只保留 1 小時：K8s 預設 events 只留 1 小時。要長期保留就要把 events 導到 Loki 或 Elasticsearch。

K8s 的錯誤訊息其實比大部分系統清楚。kubectl describe 的 Events 區塊幾乎都會直接告訴你原因。問題是——大部分人不知道要去看 Events。現在你知道了。

本系列文章

← 上一篇：監控：Prometheus Operator + Grafana on K8s
本篇：Troubleshooting：從 Pending 到 CrashLoopBackOff
下一篇：GitOps：ArgoCD 宣告式部署 →

Terry Yao's Blog

分類

目錄