[infra] 多機監控與服務發現：跨節點的可觀測性

cover

多機監控與服務發現：單台機器的監控不夠用了

當你的基礎設施只有一台機器時，在上面跑 Prometheus + Grafana、裝一個 node_exporter，就能掌握 CPU、記憶體、磁碟、網路的狀態。但現實中，服務規模成長之後，你會有 Web Server、API Server、Database、Background Worker、Cache 各跑在不同的機器上。一台變三台、三台變十台、十台變五十台。這時候問題就來了：Prometheus 要怎麼知道有哪些機器需要被 scrape？新增一台機器時，要手動去改 prometheus.yml 再 reload 嗎？如果某台機器的 node_exporter 掛了，誰會知道？跨區域的 Prometheus 怎麼整合？

這篇文章會涵蓋多機監控的完整架構，包含 node_exporter 在每台機器上的部署、Service Discovery 機制讓 Prometheus 自動發現新的 target、Blackbox Exporter 做外部健康檢查、以及 Prometheus Federation 在多叢集場景下的應用。

架構概覽

flowchart TD
  A[Node A\nnode_exporter] -->|scrape| Prom[Prometheus\n中央收集器]
  B[Node B\nnode_exporter] -->|scrape| Prom
  C[Node C\nnode_exporter] -->|scrape| Prom
  SD[Service Discovery\ntargets.json] -->|自動發現| Prom
  BB[Blackbox Exporter\n外部探測] -->|健康檢查結果| Prom
  Prom -->|查詢| Grafana[Grafana\n多節點儀表板]
  Prom -->|告警| AM[Alertmanager\n通知]

架構概覽

flowchart TD
  subgraph HostA["Host A（Web Server）"]
    NE_A[node_exporter\n:9100]
    App_A[nginx_exporter\n:9113]
  end

  subgraph HostB["Host B（API Server）"]
    NE_B[node_exporter\n:9100]
    App_B[app /metrics\n:8000]
  end

  subgraph HostC["Host C（Database）"]
    NE_C[node_exporter\n:9100]
    PG_C[postgres_exporter\n:9187]
  end

  subgraph HostD["Host D（Worker）"]
    NE_D[node_exporter\n:9100]
    Worker_D[worker /metrics\n:8001]
  end

  subgraph Monitoring["Monitoring Stack"]
    Prom[Prometheus\n:9090]
    Grafana[Grafana\n:3000]
    AM[Alertmanager\n:9093]
    BB[Blackbox Exporter\n:9115]
    SD[targets.json\nFile SD]
  end

  SD -->|file_sd_configs| Prom
  NE_A -->|scrape| Prom
  NE_B -->|scrape| Prom
  NE_C -->|scrape| Prom
  NE_D -->|scrape| Prom
  App_A -->|scrape| Prom
  App_B -->|scrape| Prom
  PG_C -->|scrape| Prom
  Worker_D -->|scrape| Prom

  BB -->|probe HTTP/TCP/ICMP| HostA
  BB -->|probe HTTP/TCP/ICMP| HostB
  Prom -->|scrape probe results| BB

  Prom -->|query| Grafana
  Prom -->|evaluate rules| AM
  AM -->|notify| Slack[Slack / Discord]
  Grafana --> Dashboard[Dashboard\nNode Overview\nID 1860]

Prometheus 透過 file_sd_configs 讀取 targets.json，自動發現所有需要 scrape 的 target。每台 Host 上都跑一個 node_exporter 收集作業系統層級的指標，各個應用服務也各有自己的 exporter 或 /metrics endpoint。Blackbox Exporter 從外部發起 HTTP/TCP/ICMP probe，確認服務是否可達。Alertmanager 處理告警路由與通知。

核心概念

node_exporter：每台機器的基礎指標

node_exporter 是 Prometheus 官方維護的 exporter，專門收集 Linux/Unix Host 層級的指標。它會暴露大量的系統指標在 :9100/metrics，包括：
- CPU：node_cpu_seconds_total（各 CPU core 在不同 mode 的時間，用 rate() 算使用率）
- 記憶體：node_memory_MemTotal_bytes、node_memory_MemAvailable_bytes（用來算使用率）
- 磁碟：node_filesystem_avail_bytes、node_filesystem_size_bytes（各 mountpoint 的使用狀況）、node_disk_io_time_seconds_total（磁碟 I/O 負載）
- 網路：node_network_receive_bytes_total、node_network_transmit_bytes_total（各網路介面的流量）
- 系統負載：node_load1、node_load5、node_load15（1/5/15 分鐘 load average）
關鍵原則：每一台你管理的機器都要裝 node_exporter。不管它是跑 Web Server、Database 還是 Background Worker，node_exporter 收集的是作業系統層級的指標，與上面跑什麼應用無關。如果你有 10 台機器，就要有 10 個 node_exporter 在跑。
Service Discovery：自動發現監控目標

當機器數量少（1~3 台）時，在 prometheus.yml 裡用 static_configs 手動列出每個 target 的 IP 和 port 就夠了。但機器數量一多，手動維護 target 清單就變成負擔：新增一台機器要改設定檔並 reload Prometheus、忘了加就沒有監控、離線了也不知道。

Prometheus 支援多種 Service Discovery 機制：
- static_configs：最簡單，直接在 prometheus.yml 裡寫死 target list。適合機器數量少且幾乎不變的情況。
- file_sd_configs：Prometheus 監聽一個 JSON 或 YAML 檔案（例如 targets.json），檔案變動時自動 reload，不需要重啟 Prometheus。這是最實用的方式——你可以用 Ansible、Terraform、或一個 script 自動產生 targets.json，Prometheus 就會自動發現新的 target。
- consul_sd_configs：Prometheus 向 Consul 查詢已註冊的服務，自動加入 scrape list。適合已經在用 Consul 做服務註冊的環境。
- dns_sd_configs：透過 DNS SRV record 發現 target。適合 Kubernetes 或有 DNS-based service discovery 的環境。
- kubernetes_sd_configs：專為 Kubernetes 設計，自動發現 Pod、Service、Node、Endpoint。
對於 10 台以上的環境，建議至少用 file_sd_configs。手動改 prometheus.yml 容易出錯，而且每次都要 reload Prometheus。用 file_sd_configs 的話，只要更新 JSON 檔案，Prometheus 會自動偵測變更並重新載入 target。
Prometheus Federation：跨叢集的指標聚合

當你的基礎設施跨越多個區域（region）或叢集（cluster）時，不可能只用一台 Prometheus 去 scrape 所有的 target——網路延遲和頻寬都是問題。這時候可以用 Prometheus Federation：
- 每個區域或叢集有自己的 Prometheus（稱為 leaf Prometheus），負責 scrape 該區域的所有 target
- 一台 Global Prometheus 用 /federate endpoint 去拉取各 leaf Prometheus 的聚合指標
- Global Prometheus 只拉取聚合後的指標（例如 avg、sum），不拉取所有原始數據，減少頻寬和儲存
- Grafana 連接 Global Prometheus 就能在一個 dashboard 看到所有區域的概覽
Federation 的替代方案是 Thanos 或 Cortex，它們提供更完整的多叢集指標整合方案，但架構複雜度也更高。如果你只有 2~3 個區域，Federation 通常就夠了。
Uptime Monitoring：外部可達性檢查

node_exporter 和 application exporter 收集的是內部指標——「服務自己覺得自己活著」。但從外部看，服務可能因為防火牆、DNS、反向代理、負載均衡等問題而不可達。所以需要從外部來探測服務是否真的可用。

Blackbox Exporter 是 Prometheus 官方提供的外部探測工具。它可以對目標進行以下探測：
- HTTP probe：發送 HTTP/HTTPS 請求，檢查回應狀態碼（2xx）、回應時間、TLS 憑證有效期
- TCP probe：檢查 TCP port 是否可連線（例如 PostgreSQL 的 5432 port）
- ICMP probe：ping 檢查主機是否可達
- DNS probe：查詢 DNS 解析是否正常
Uptime Kuma 是另一個常用的選擇。它是一個輕量級的 Uptime 監控工具，有自己的 Web UI，支援 HTTP/TCP/DNS/ping 等多種監控類型，也支援通知到 Slack/Discord/Telegram。如果你不想整合到 Prometheus 生態，Uptime Kuma 是個獨立且好用的替代方案。
Health Check 模式：/health endpoint 設計

現代的服務應該暴露 health check endpoint，讓監控系統和負載均衡器知道服務的健康狀態。常見的兩種模式：
- Liveness Probe（存活探測）：回答「這個 process 還活著嗎？」。如果 liveness 失敗，代表 process 卡住了，應該要重啟。通常是一個最簡單的 GET /healthz 回傳 200 OK。
- Readiness Probe（就緒探測）：回答「這個服務能處理請求嗎？」。即使 process 活著，如果資料庫連不上、快取還沒預熱完成、或正在做 graceful shutdown，readiness 應該回傳失敗，讓負載均衡器暫時不送流量過來。
一個好的 /health endpoint 應該檢查關鍵依賴（資料庫連線、快取連線），並回傳結構化的 JSON 回應：
```
{
  "status": "healthy",
  "checks": {
    "database": { "status": "up", "latency_ms": 2 },
    "redis": { "status": "up", "latency_ms": 1 },
    "disk": { "status": "up", "free_gb": 42.3 }
  },
  "version": "1.5.2",
  "uptime_seconds": 86400
}
```
Grafana Node Dashboard：一目了然的主機儀表板

Grafana 社群有一個非常經典的 dashboard——Node Exporter Full（Dashboard ID: 1860）。匯入這個 dashboard 之後，你可以在一個畫面看到每台主機的 CPU 使用率、記憶體使用率、磁碟 I/O、網路流量、系統負載、以及各種詳細的硬體指標。

匯入方式：Grafana UI → Dashboards → Import → 輸入 ID 1860 → 選擇 Prometheus data source → Import。

如果你有多台機器，dashboard 上方的 instance 下拉選單可以快速切換不同主機。建議搭配一個 Overview dashboard，用 table panel 列出所有主機的關鍵指標（CPU%、RAM%、Disk%、Uptime），一眼看出哪台機器需要關注。

實作範例 / 設定範例

多機監控 docker-compose 部署

在你的監控主機上部署 Prometheus、Grafana、Alertmanager 和 Blackbox Exporter。每台被監控的機器只需要跑 node_exporter。

# docker-compose.monitoring.yml
version: "3.8"
 
services:
  prometheus:
    image: prom/prometheus:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert-rules.yml:/etc/prometheus/alert-rules.yml
      - ./targets/:/etc/prometheus/targets/
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=15GB'
      - '--web.enable-lifecycle'
 
  grafana:
    image: grafana/grafana:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana-data:/var/lib/grafana
 
  alertmanager:
    image: prom/alertmanager:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
 
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:9115:9115"
    volumes:
      - ./blackbox.yml:/config/blackbox.yml
    command:
      - '--config.file=/config/blackbox.yml'
 
  # 監控主機自己的 node_exporter
  node-exporter:
    image: prom/node-exporter:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
 
volumes:
  prometheus-data:
  grafana-data:

在其他被監控的機器上，只需要部署 node_exporter：

# docker-compose.node-exporter.yml（部署在每台被監控的機器上）
version: "3.8"
 
services:
  node-exporter:
    image: prom/node-exporter:latest
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'

prometheus.yml：使用 file_sd_configs 動態發現 target

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s
 
rule_files:
  - "alert-rules.yml"
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
scrape_configs:
  # Prometheus 自身指標
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  # 所有 Host 的 node_exporter（使用 file_sd_configs）
  - job_name: 'node'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/nodes.json'
        refresh_interval: 30s
 
  # 應用服務（使用 file_sd_configs）
  - job_name: 'applications'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/applications.json'
        refresh_interval: 30s
 
  # PostgreSQL exporter
  - job_name: 'postgresql'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/postgresql.json'
        refresh_interval: 30s
 
  # Blackbox HTTP probe
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/blackbox-http.json'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
 
  # Blackbox TCP probe
  - job_name: 'blackbox-tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/blackbox-tcp.json'
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
 
  # Prometheus Federation（從其他區域的 Prometheus 拉取聚合指標）
  # - job_name: 'federation-region-b'
  #   honor_labels: true
  #   metrics_path: '/federate'
  #   params:
  #     'match[]':
  #       - '{job="node"}'
  #       - '{job="applications"}'
  #       - '{__name__=~"job:.*"}'
  #   static_configs:
  #     - targets: ['prometheus-region-b.internal:9090']
  #       labels:
  #         region: 'region-b'

targets.json：File-based Service Discovery 設定檔

這些 JSON 檔案放在 ./targets/ 目錄下，Prometheus 會根據 refresh_interval 定期重新讀取。新增機器時只需要修改 JSON 檔案，不需要重啟 Prometheus。

// targets/nodes.json
[
  {
    "targets": ["192.168.1.10:9100"],
    "labels": {
      "env": "production",
      "role": "web",
      "hostname": "web-01"
    }
  },
  {
    "targets": ["192.168.1.11:9100"],
    "labels": {
      "env": "production",
      "role": "api",
      "hostname": "api-01"
    }
  },
  {
    "targets": ["192.168.1.12:9100"],
    "labels": {
      "env": "production",
      "role": "database",
      "hostname": "db-01"
    }
  },
  {
    "targets": ["192.168.1.13:9100"],
    "labels": {
      "env": "production",
      "role": "worker",
      "hostname": "worker-01"
    }
  },
  {
    "targets": ["192.168.1.20:9100"],
    "labels": {
      "env": "staging",
      "role": "all-in-one",
      "hostname": "staging-01"
    }
  }
]

// targets/blackbox-http.json
[
  {
    "targets": [
      "https://example.com",
      "https://api.example.com/health",
      "https://staging.example.com"
    ],
    "labels": {
      "env": "production",
      "probe_type": "http"
    }
  }
]

// targets/applications.json
[
  {
    "targets": ["192.168.1.11:8000"],
    "labels": {
      "env": "production",
      "service": "api",
      "hostname": "api-01"
    }
  },
  {
    "targets": ["192.168.1.13:8001"],
    "labels": {
      "env": "production",
      "service": "worker",
      "hostname": "worker-01"
    }
  }
]

Blackbox Exporter 設定檔

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200, 201, 204]
      method: GET
      follow_redirects: true
      fail_if_ssl: false
      fail_if_not_ssl: false
      tls_config:
        insecure_skip_verify: false
 
  http_2xx_with_tls_check:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]
      method: GET
      fail_if_not_ssl: true
      tls_config:
        insecure_skip_verify: false
 
  tcp_connect:
    prober: tcp
    timeout: 5s
 
  icmp_ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"

多機監控告警規則

# alert-rules.yml
groups:
  - name: node-monitoring
    rules:
      # 主機不可達（node_exporter scrape 失敗）
      - alert: HostDown
        expr: up{job="node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.hostname }} ({{ $labels.instance }}) is down"
          description: "node_exporter 已經超過 2 分鐘無法被 scrape，主機可能離線或 node_exporter 服務停止。"
          runbook_url: "https://wiki.internal/runbooks/host-down"
 
      # 磁碟空間不足（> 85% 使用率）
      - alert: DiskSpaceHigh
        expr: >
          (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}
          / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"})) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage > 85% on {{ $labels.hostname }} ({{ $labels.instance }})"
          description: "磁碟使用率已超過 85%，目前為 {{ $value | printf \"%.1f\" }}%。請清理不必要的檔案或評估擴容。"
 
      # 磁碟空間危急（> 95% 使用率）
      - alert: DiskSpaceCritical
        expr: >
          (1 - (node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}
          / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"})) * 100 > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Disk usage > 95% on {{ $labels.hostname }} ({{ $labels.instance }})"
          description: "磁碟使用率超過 95%，即將耗盡。立即處理！"
 
      # CPU 持續高負載（> 90%）
      - alert: HighCPU
        expr: >
          100 - (avg by(instance, hostname)
          (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage > 90% on {{ $labels.hostname }} for 10 minutes"
          description: "CPU 使用率持續超過 90% 達 10 分鐘，目前為 {{ $value | printf \"%.1f\" }}%。"
 
      # 記憶體使用率過高（> 90%）
      - alert: HighMemory
        expr: >
          (1 - (node_memory_MemAvailable_bytes
          / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage > 90% on {{ $labels.hostname }} ({{ $labels.instance }})"
          description: "記憶體使用率超過 90%，目前為 {{ $value | printf \"%.1f\" }}%。可能需要增加記憶體或排查 memory leak。"
 
      # 系統負載過高
      - alert: HighLoadAverage
        expr: node_load15 / count without(cpu, mode) (node_cpu_seconds_total{mode="idle"}) > 2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Load average (15m) is very high on {{ $labels.hostname }}"
          description: "15 分鐘負載平均值是 CPU 核心數的 2 倍以上，系統可能過載。"
 
  - name: blackbox-monitoring
    rules:
      # HTTP endpoint 不可達
      - alert: EndpointDown
        expr: probe_success{job="blackbox-http"} == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP endpoint {{ $labels.instance }} is down"
          description: "Blackbox Exporter 無法成功 probe {{ $labels.instance }}，服務可能不可達。"
 
      # TLS 憑證即將過期（14 天內）
      - alert: TLSCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "TLS certificate for {{ $labels.instance }} expires in {{ $value | printf \"%.0f\" }} days"
          description: "TLS 憑證將在 14 天內過期，請盡快更新。參考 [[02-reverse-proxy-tls|TLS 憑證管理]]。"
 
      # HTTP 回應時間過慢
      - alert: SlowHTTPResponse
        expr: probe_http_duration_seconds{phase="transfer"} > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow HTTP response from {{ $labels.instance }}"
          description: "HTTP 回應時間超過 5 秒，可能有效能問題。"

自動化 Target 管理

手動編輯 targets.json 在機器少的時候可以接受，但當數量超過 10 台時，建議用自動化工具產生 target 檔案。以下是幾種常見做法：

Ansible：在部署 node_exporter 的同時，用 Ansible template 產生 targets.json 並同步到 Prometheus 主機。每次跑 playbook 新增或移除機器時，target 檔案自動更新。
Terraform：如果你用 Terraform 管理雲端機器，可以用 local_file resource 根據 Terraform state 自動產生 targets.json。
Consul：如果你有 Consul cluster，每個服務啟動時自動註冊到 Consul，Prometheus 用 consul_sd_configs 自動發現。這是最即時的方式，但需要額外維護 Consul。
簡易 Script：寫一個 cron job，定期掃描特定網段的 9100 port（node_exporter 預設 port），把可用的 target 寫入 targets.json。簡單粗暴但有效。

Prometheus Federation 實務設定

當你有多個區域或多個 Kubernetes 叢集，各自有獨立的 Prometheus 時，可以用 Federation 把指標聚合到一台 Global Prometheus。

# global-prometheus.yml（Global Prometheus 的設定）
global:
  scrape_interval: 30s
  evaluation_interval: 30s
 
scrape_configs:
  # 從 Region A 的 Prometheus 拉取聚合指標
  - job_name: 'federation-region-a'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="node"}'
        - '{job="applications"}'
        - '{__name__=~"job:.*"}'     # 只拉取 recording rules 聚合後的指標
    static_configs:
      - targets: ['prometheus-a.internal:9090']
        labels:
          region: 'asia-east'
 
  # 從 Region B 的 Prometheus 拉取聚合指標
  - job_name: 'federation-region-b'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="node"}'
        - '{job="applications"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ['prometheus-b.internal:9090']
        labels:
          region: 'us-west'

Federation 的重要注意事項：

honor_labels: true 必須設定，否則 Global Prometheus 會覆蓋 leaf Prometheus 的 label
match[] 要精確指定需要的指標，不要用 {__name__=~".+"} 拉所有指標，否則資料量會暴增
Global Prometheus 的 scrape_interval 通常設得比 leaf 長（例如 30s~60s），因為它只看聚合指標

常見問題與風險

Target 太多導致 Prometheus scrape timeout：如果你有 100 台機器，每台 node_exporter 暴露幾千個 metric，Prometheus 在一個 scrape_interval（預設 15 秒）內要完成所有 scrape。如果網路慢或 target 太多，可能來不及 scrape 完就到下一輪了。解決方式：增加 scrape_timeout、延長 scrape_interval（例如改成 30s）、或者分多台 Prometheus 用 Federation 架構分散負載。觀察 prometheus_target_scrape_pool_exceeded_target_limit_total 和 scrape_duration_seconds 指標來判斷。
node_exporter 沒裝但沒人知道：新開了一台機器但忘了裝 node_exporter，或者 node_exporter 的 container 掛了沒有自動重啟。Prometheus 那邊 up == 0 但如果沒有告警規則，就不會有人注意到。更糟的是如果 target 根本沒加到 targets.json，Prometheus 完全不知道有這台機器。避免方式：（1）對 up{job="node"} == 0 設定 Critical 告警；（2）用 Ansible 部署時同步更新 target 清單，確保每台機器都有被監控；（3）定期比對 CMDB/inventory 和 Prometheus 的 target 清單，找出遺漏。
磁碟空間被 Prometheus TSDB 吃滿：監控的機器越多、指標越多、保留時間越長，Prometheus 的時間序列資料庫（TSDB）會持續增長。如果沒設定 retention 限制，最終會把磁碟吃滿。諷刺的是，Prometheus 自己掛了之後就沒有監控了。避免方式：（1）設定 --storage.tsdb.retention.time=30d 和 --storage.tsdb.retention.size=15GB，雙重限制保留策略；（2）用 prometheus_tsdb_storage_blocks_bytes 指標監控 TSDB 大小；（3）清理不必要的 metric（用 metric_relabel_configs 過濾掉不需要的指標）。
網路不通導致 scrape 失敗：Prometheus 需要能透過網路連到每台機器的 node_exporter port（預設 9100）。如果防火牆規則沒開、安全群組沒設定、或者機器在不同的 VPC/VLAN 裡面，scrape 就會失敗。避免方式：（1）事先規劃監控網路，確保 Prometheus 能連到所有 target；（2）用 node_exporter 的 --web.listen-address 限制只聽特定的內網 IP；（3）如果跨網路環境，考慮在每個網路區域部署獨立的 Prometheus，再用 Federation 聚合。
Label 不一致造成 dashboard 混亂：不同機器的 targets.json 由不同人維護，label 命名不一致（有的用 hostname、有的用 host、有的用 node），導致 Grafana dashboard 的變數選單無法正確篩選。避免方式：制定 label 命名規範（例如統一用 hostname、env、role），用 relabel_configs 在 Prometheus 端統一 label 名稱。

優點

file_sd_configs 讓新增/移除 target 不需要重啟 Prometheus，管理靈活
每台機器跑 node_exporter 就能取得完整的 OS 層級指標，不需要裝 agent
Blackbox Exporter 提供外部視角的健康檢查，彌補內部指標的盲區
Federation 架構可以水平擴展監控能力，不受單台 Prometheus 的限制

缺點 / 限制

機器多了之後 targets.json 的維護需要自動化，手動管理容易出錯
Federation 只適合拉取聚合指標，原始數據查詢仍需到各 leaf Prometheus
Blackbox Exporter 只能做簡單的 probe，無法模擬完整的使用者操作流程
node_exporter 主要支援 Linux，Windows Server 需要用 windows_exporter

Terry Yao's Blog

分類

目錄