B17 · 觀測性詳細 ROADMAP

計畫文件，不會被 Quartz 渲染。回主 roadmap → backend/ROADMAP.md

章節目標

「能看見的才能治理」。後端可觀測性的三大支柱：Logs / Metrics / Traces。本章涵蓋結構化日誌 / Prometheus / OpenTelemetry / APM / alert / dashboard 設計。Proto 用 structlog + Prometheus + Grafana + Loki + AlertManager 全套，infra 系列也有 15–19 整套實戰。

🌱 基本介紹

#	主題	Slug	Stage	大綱
01	觀測性是什麼	`01-what-is-observability`	🌱	從 Monitoring 到 Observability 的轉變；三大支柱 Logs / Metrics / Traces

❓ 為什麼需要

#	主題	Slug	Stage	大綱
02	為什麼 `print` log 到 production 就崩	`02-why-structured-log`	🌱	格式化查詢困難、CloudWatch / Loki 解析失敗、敏感資料漏寫；結構化 log 的必要性
03	為什麼光有 Log 不夠	`03-why-not-just-log`	🌱	Log 只能看單 request、分散式系統要 trace、效能問題要 metrics；三支柱互補
04	為什麼 alert 太多等於沒 alert	`04-why-alert-fatigue-matters`	🌱	PagerDuty 疲勞、凌晨 3 點 page 但是 false positive；alert 分層（page vs notify）

🕰️ 演進

#	主題	Slug	Stage	大綱
05	觀測工具演進	`05-observability-evolution`	🌱	syslog → ELK → Prometheus（2012）→ OpenTelemetry（2019，統一 API）→ eBPF 觀測（2022+）

🧠 知識型

F17-A Logs

#	主題	Slug	Stage	大綱
06	結構化 Log 設計	⛔️ `backend/express/generic-log` / `backend/conventions/` B07 #8	🌿	跨系列
06-2	微服務 Logging 實戰	⛔️ `micro-service/44-observability-logging`	🌿	跨系列
06-3	為什麼不能用 print / console.log（基礎入門）	`06-3-why-not-print`	🌱	沒結構化（無法查詢過濾）、沒時間戳 / 層級、污染 stdout / 跟框架 log 混、多 instance 無法對齊、production 無 log aggregation；為什麼 `logger.info()` 不是 `print()` 的替代品而是完全不同的東西
06-4	好的 Log 該有的元素 checklist	`06-4-good-log-elements`	🌱	必備欄位：timestamp（含時區）/ level / service / correlation_id / trace_id / user_id / action / context payload；加分欄位：caller location / duration / db_query_count；反例（別做的欄位：password / token / 完整 email）
07	Log Level 正確用法	`07-log-levels`	🌱	DEBUG / INFO / WARN / ERROR / FATAL 的實際語意、什麼時候用哪個
08	敏感資料遮蔽	`08-log-redaction`	🌱	PII / password / token 遮蔽、middleware 層面做、reg-based 不可靠
09	Log 聚合	⛔️ `infra/16-log-management`	🌿	跨系列
10	Log 成本控制	`10-log-cost-control`	🌱	寫太多 CloudWatch 付費到吐血、sampling、TTL、分層儲存

F17-B Metrics

#	主題	Slug	Stage	大綱
11	Metrics 基礎	⛔️ `infra/15-metrics-monitoring`	🌿	跨系列
12	四大 Metrics 類型	`12-four-metric-types`	🌱	Counter / Gauge / Histogram / Summary 各自用法
13	RED / USE / Golden Signals	`13-metric-methodologies`	🌱	RED（Rate / Errors / Duration）給 service、USE（Utilization / Saturation / Errors）給 resource、Google Golden Signals
14	Prometheus 進階	`14-prometheus-advanced`	🌱	PromQL、recording rule、federation、high cardinality 坑
15	Application-level Metrics	`15-app-metrics`	🌱	FastAPI prometheus-instrumentator、Spring Micrometer、自訂 metrics 設計

F17-C Traces

#	主題	Slug	Stage	大綱
16	Distributed Tracing 基礎	`16-distributed-tracing-basics`	🌱	Trace / Span / Context、propagation、父子 span
16-2	微服務 Tracing 實戰	⛔️ `micro-service/43-observability-tracing`	🌿	跨系列
17	OpenTelemetry 深入	`17-opentelemetry`	🌱	SDK 整合、Exporter、Collector；跨語言統一 API
18	Context Propagation	`18-context-propagation`	🌱	W3C Trace Context、B3 header、Baggage；跨 service / MQ 傳 context
19	多節點監控	⛔️ `infra/19-multi-node-monitoring`	🌿	跨系列

F17-D APM

#	主題	Slug	Stage	大綱
20	APM（Application Performance Monitoring）	`20-apm-tools`	🌱	Datadog / New Relic / Elastic APM / Pyroscope、各自強項、自架 vs SaaS
21	Continuous Profiling	`21-continuous-profiling`	🌱	Pyroscope / Parca / Polar Signals、CPU / heap / lock profile 持續收集

F17-E Alert

#	主題	Slug	Stage	大綱
22	告警設計	⛔️ `infra/17-alerts-chatops` / `18-alert-webhook-integration`	🌿	跨系列
23	Alert 分級	`23-alert-severity`	🌱	Page（半夜叫起來）vs Notify（上班看）vs Info（不看）；SLO-based alert
24	SLO / SLA / SLI	`24-slo-sla-sli`	🌱	Google SRE 的觀念、Error Budget、如何從 SLI 推到 alert rule
25	Runbook 寫法	`25-runbook-writing`	🌱	alert 裡 link 到 runbook、On-call 接 page 後照做、避免每次靠 tribal knowledge

F17-F Dashboard

#	主題	Slug	Stage	大綱
26	Dashboard 設計原則	`26-dashboard-design`	🌱	Service dashboard vs Investigation dashboard、層次結構
26-2	微服務 Dashboard 實戰	⛔️ `micro-service/45-observability-dashboard`	🌿	跨系列

🔧 小實作注意事項

#	主題	Slug	Stage	大綱
27	本機起 Prom + Grafana + Loki	`27-local-observability-stack`	🌱	Docker Compose 一套、跟 proto stack 對照
28	Correlation ID / Trace ID 貫穿 Log	`28-trace-id-in-log`	🌱	proto 的 request_id middleware 案例；log query by trace_id
29	壓測配合觀測看瓶頸	`29-stress-test-with-observability`	🌱	k6 + Prom + Grafana；壓測前先確認 metrics 到位

💣 Anti-pattern

#	主題	Slug	Stage	大綱
30	觀測 Anti-patterns	`30-observability-antipatterns`	🌱	只有 CPU / Memory 沒 app metrics、log level 只會 INFO、log 塞 PII、metrics 高 cardinality 爆 Prom、alert 沒 runbook、無 trace ID 下 debug 分散式問題、dashboard 掛但沒人看

🧰 對應檢查工具

#	主題	Slug	Stage	大綱
31	觀測工具生態	`31-observability-tooling`	🌱	Prometheus / Grafana / Loki / Tempo / OpenTelemetry Collector、Datadog / New Relic、eBPF 工具（Pixie / Parca）

📎 補充

#	主題	Slug	Stage	大綱
S01	eBPF 觀測入門	`s01-ebpf-observability`	🌱	Pixie / Parca、自動 instrument、不改 code 看 production
S02	好的監控系統	⛔️ `standards/07-good-monitoring-system`	🌿	跨系列

章節進度統計

知識主題：31 + 2 補充 = 33 項
🌿 growing：7（跨系列）
🌱 seed：26

跨系列連結

→ infra/15–19（metrics / log / alert / multi-node）
→ micro-service/43–45（tracing / logging / dashboard 微服務實戰）
→ backend/express/generic-log、backend/conventions/ B07 #8
→ backend/stress-testing/ B19（壓測配合觀測）
→ standards/07-good-monitoring-system
→ frontend/observability/ CH15（RUM 前端觀測）

Terry Yao's Blog

目錄

ROADMAP

B17 · 觀測性詳細 ROADMAP

章節目標

🌱 基本介紹

❓ 為什麼需要

🕰️ 演進

🧠 知識型

F17-A Logs

F17-B Metrics

F17-C Traces

F17-D APM

F17-E Alert

F17-F Dashboard

🔧 小實作注意事項

💣 Anti-pattern

🧰 對應檢查工具

📎 補充

章節進度統計

跨系列連結

關係圖譜

反向連結

Terry Yao's Blog

目錄

B17 · 觀測性 詳細 ROADMAP

章節目標

🌱 基本介紹

❓ 為什麼需要

🕰️ 演進

🧠 知識型

F17-A Logs

F17-B Metrics

F17-C Traces

F17-D APM

F17-E Alert

F17-F Dashboard

🔧 小實作注意事項

💣 Anti-pattern

🧰 對應檢查工具

📎 補充

章節進度統計

跨系列連結

關係圖譜

反向連結

B17 · 觀測性詳細 ROADMAP