I05 · 觀測平台詳細 ROADMAP

計畫文件，不會被 Quartz 渲染。回主 roadmap → infra/ROADMAP.md

章節目標

觀測是跨整條 request 流程的橫切面——從 I01 Edge 到 I04 Data 每一層都要被觀測到。本章處理觀測平台本身的部署與運維：Prometheus / Loki / Tempo / Grafana / AlertManager / OpenTelemetry Collector。

跟 backend 的分工：backend B17 講「怎麼在 app 埋 metric / 寫結構化 log」；infra 這章講「怎麼把 Prometheus / Loki 架起來、scaling、HA、成本控制、多 tenant」。

🌱 基本介紹

#	主題	Slug	Stage	大綱
01	觀測平台全景	`01-observability-landscape`	🌱	三大支柱 Logs / Metrics / Traces + 新加 Events / Profiles；各家 stack 對比（Prom+Grafana / Datadog / New Relic）

❓ 為什麼需要

#	主題	Slug	Stage	大綱
02	為什麼用 Prometheus 不用 Nagios	`02-why-prometheus-over-nagios`	🌱	pull-based 對 cloud-native 友善；時序資料格式；跟 K8s 整合度
03	為什麼需要 Log 聚合而不是 ssh grep	`03-why-log-aggregation`	🌱	分散式系統跨 node 找 log 的痛點
04	為什麼 Trace 在微服務時代變必要	`04-why-distributed-trace`	🌱	Request 跨 N 服務，看不到完整路徑就 debug 不了；跟 B17 #19 呼應
05	為什麼觀測成本會爆	`05-why-observability-cost`	🌱	cardinality 爆、log retention 沒設、指標沒篩過、每個 team 自己埋重複東西；要集中控管

🕰️ 演進

#	主題	Slug	Stage	大綱
06	Metrics 演進	⛔️ `infra/observability/15-metrics-monitoring`	🌿	跨系列
07	Log 管理演進	⛔️ `infra/observability/16-log-management`	🌿	跨系列
08	觀測演進驅動力	`08-observability-evolution-drivers`	🌱	Nagios / Zabbix 拉式撞 cloud-native 動態撞牆 → Prometheus 推拉混合；ELK 成本撞牆 → Loki + LogQL；Jaeger / Zipkin 格式不相容撞牆 → OpenTelemetry；sampling 看不到全貌撞牆 → eBPF / Continuous Profiling

🧠 知識型

F05-A Metrics 平台

#	主題	Slug	Stage	大綱
09	Prometheus 部署	⛔️ `infra/observability/15-metrics-monitoring`	🌿	跨系列
10	Prometheus HA（Thanos / Mimir / Cortex）	`10-prometheus-ha`	🌱	Remote storage；長期保存；跨 region 查詢
11	Grafana 部署與治理	`11-grafana-ops`	🌱	Dashboard as code、team / folder 權限、plugin 管理、Grafana Cloud 選項
12	K8s Monitoring（吸收 k8s/06）	⛔️ `infra/k8s/06-k8s-monitoring`	🌿	跨系列
13	多節點監控	⛔️ `infra/observability/19-multi-node-monitoring`	🌿	跨系列

F05-B Log 平台

#	主題	Slug	Stage	大綱
14	Log 管理（Loki / ELK）	⛔️ `infra/observability/16-log-management`	🌿	跨系列
15	Loki vs ELK 選型	`15-loki-vs-elk`	🌱	索引策略差異、成本模型、查詢能力；團隊現況怎麼選
16	Log 成本控制	`16-log-cost-control`	🌱	Retention 分層、sampling、log level 過濾、CloudWatch 到 S3 分層
17	微服務 Logging 實戰（吸收 micro-service）	⛔️ `backend/micro-service/44-observability-logging`	🌿	跨系列

F05-C Trace 平台

#	主題	Slug	Stage	大綱
18	OpenTelemetry Collector 部署	`18-otel-collector-ops`	🌱	Agent / gateway 部署模式、receiver / processor / exporter 配置
19	Tempo / Jaeger 選型	`19-tempo-vs-jaeger`	🌱	長期保存、sampling 策略、跟 Grafana 整合
20	Trace sampling 策略	`20-trace-sampling`	🌱	Head vs Tail sampling、成本跟完整度權衡
21	微服務 Tracing 實戰	⛔️ `backend/micro-service/43-observability-tracing`	🌿	跨系列

F05-D Alert / On-call / Ticket

監控 → 告警 → Ticket 是三階段的接力：觀測看到異常、告警引起注意、Ticket 追蹤處理。這節把三段的整合面講清楚，不是單點工具教學。

#	主題	Slug	Stage	大綱
22	Alert 設計（吸收 infra 原）	⛔️ `infra/observability/17-alerts-chatops`	🌿	跨系列
23	Alert Webhook 整合	⛔️ `infra/observability/18-alert-webhook-integration`	🌿	跨系列
24	AlertManager 深入	`24-alertmanager-deep`	🌱	Routing tree / inhibition / silences / grouping；跟 PagerDuty / Opsgenie 整合
25	SLO / Error Budget 實作	`25-slo-error-budget`	🌱	Google SRE book；SLI 選擇；跨 team 共識
26	監控 → 告警 → Ticket 整合鏈	`26-monitor-alert-ticket-chain`	🌱	三階段的 handoff 設計：Prometheus → AlertManager → PagerDuty → Jira / Linear / GitHub Issues；自動開票規則；Issue tracker 選型見 `management/project-mgmt/` 的 Issue Tracker 選型篇

F05-E Dashboard 與 APM

#	主題	Slug	Stage	大綱
27	Dashboard 設計原則	`27-dashboard-design`	🌱	Service overview / investigation / executive；跟 RED / USE method 整合
28	Grafana Unified Observability	`28-grafana-unified`	🌱	Grafana 作為 one pane of glass：把 Prom（metrics）+ Loki（logs）+ Tempo（traces）+ Pyroscope（profiles）+ Alert 全部整合在同一 UI；Explore view、correlation、統一查詢語言策略
29	微服務 Dashboard 實戰	⛔️ `backend/micro-service/45-observability-dashboard`	🌿	跨系列
30	APM 選型（吸收 backend B17 #25）	`30-apm-selection`	🌱	Datadog / New Relic / Elastic APM / Pinpoint（Naver 開源，台日企業常見）/ Pyroscope / SkyWalking、自架 vs SaaS；跟 `infra/cloud/aws/07 CloudWatch` 對照
31	Continuous Profiling（吸收 backend B17 #26）	`31-continuous-profiling-ops`	🌱	Pyroscope / Parca 部署；profile 資料儲存；成本
32	eBPF 觀測（吸收 backend B17 S01 + B04 #26）	`32-ebpf-observability`	🌱	Pixie / Cilium Hubble / Beyla；自動 instrument；production 使用門檻

🔧 小實作注意事項

#	主題	Slug	Stage	大綱
33	本機起完整觀測 stack	`33-local-observability-stack`	🌱	Prom + Grafana + Loki + Tempo + OTel Collector + Pyroscope 一套 compose；sample app instrumented
34	Grafana-as-code 實作	`34-grafana-as-code`	🌱	Dashboard JSON 版本控；Grafonnet / Terraform provider
35	告警 → Ticket 自動化 demo	`35-alert-to-ticket-demo`	🌱	AlertManager → PagerDuty → Linear API / GitHub Issue 的串接實作

💣 Anti-pattern

#	主題	Slug	Stage	大綱
36	觀測平台 Anti-patterns	`36-observability-antipatterns`	🌱	只看 CPU/memory 沒看 app metric、metric cardinality 爆（每 user 一條）、log retention 無限、alert 過多（全淹沒）、沒 runbook、dashboard 一堆沒人看、OTel Collector 單點、告警響了沒 ticket 追蹤、Ticket 開了但沒 runbook 連結

🧰 對應檢查工具

#	主題	Slug	Stage	大綱
37	觀測平台工具	`37-observability-platform-tooling`	🌱	Metrics: Prometheus / Grafana / VictoriaMetrics；Logs: Loki / Elasticsearch；Traces: Tempo / Jaeger；APM: Datadog / New Relic / Pinpoint / SkyWalking / Elastic APM；Profiling: Pyroscope / Parca；Alert: AlertManager / PagerDuty / Opsgenie；SaaS: Datadog / Grafana Cloud / Honeycomb

📎 補充

#	主題	Slug	Stage	大綱
S01	多 tenant 觀測平台	`s01-multi-tenant-observability`	🌱	跨團隊共用 Prom / Loki；namespace 隔離；成本分攤
S02	好的監控系統	⛔️ `common/quality/standards/07-good-monitoring-system`	🌿	跨系列

章節進度統計

知識主題：37 + 2 補充 = 39 項（2026-04 新增 3 題：監控→告警→Ticket 鏈、Grafana Unified Observability、告警→Ticket 自動化 demo）
🌿 growing：10（既有 infra/ + k8s + pointer）
🌱 seed：29

跨系列連結

→ infra/observability/15-19（原始 metrics / log / alert）
→ infra/k8s/06-k8s-monitoring
→ backend/observability/ B17 25-26 / S01（已 pointer 到本章 30-32）
→ backend/architecture/ B08 #41 Monitoring Per Service（已 pointer）
→ backend/os/ B04 #26 eBPF（已 pointer 到本章 #32）
→ infra/disaster-recovery/ I09 #26 DR event 閉環 workflow（alert → ticket → runbook → case → postmortem）
→ management/project-mgmt/ Issue Tracker 選型（Alert → Ticket 流程）
→ backend/micro-service/43-45（tracing / logging / dashboard 實戰）
→ common/quality/standards/07-good-monitoring-system
→ I01-I04 每章都被觀測 → I05 是橫切面
→ I09 DR（觀測平台本身的 DR）

Terry Yao's Blog

目錄

ROADMAP

I05 · 觀測平台詳細 ROADMAP

章節目標

🌱 基本介紹

❓ 為什麼需要

🕰️ 演進

🧠 知識型

F05-A Metrics 平台

F05-B Log 平台

F05-C Trace 平台

F05-D Alert / On-call / Ticket

F05-E Dashboard 與 APM

🔧 小實作注意事項

💣 Anti-pattern

🧰 對應檢查工具

📎 補充

章節進度統計

跨系列連結

關係圖譜

反向連結

Terry Yao's Blog

目錄

I05 · 觀測平台 詳細 ROADMAP

章節目標

🌱 基本介紹

❓ 為什麼需要

🕰️ 演進

🧠 知識型

F05-A Metrics 平台

F05-B Log 平台

F05-C Trace 平台

F05-D Alert / On-call / Ticket

F05-E Dashboard 與 APM

🔧 小實作注意事項

💣 Anti-pattern

🧰 對應檢查工具

📎 補充

章節進度統計

跨系列連結

關係圖譜

反向連結

I05 · 觀測平台詳細 ROADMAP