[aws] CloudWatch / X-Ray / Container Insights：AWS 可觀測三支柱

CloudWatch / X-Ray / Container Insights：看懂你的系統

服務掛了，你知道原因嗎？如果你的回答是「看 log」，那你只用到了可觀測性的三分之一。

先講結論

可觀測性有三支柱：Metrics（指標）、Logs（日誌）、Traces（追蹤）。AWS 用三個服務覆蓋：CloudWatch 管指標和日誌、X-Ray 管分散式追蹤、Container Insights 專門看 ECS/EKS 的容器指標。這三個搭起來，從「系統哪裡慢」到「慢在哪一段」都能回答。

CloudWatch Metrics：你的儀表板

CloudWatch 是 AWS 的監控核心。所有 AWS 服務都會自動推送指標到 CloudWatch。

你一定要盯的指標

服務	指標	警戒線	意義
EC2	CPUUtilization	>80%	該擴容了
RDS	FreeableMemory	<500MB	DB 記憶體快爆了
RDS	ReadLatency	>5ms	Query 太慢或 IOPS 不夠
ALB	TargetResponseTime	>1s	後端回應太慢
ALB	HTTPCode_Target_5XX_Count	>0	App 在噴 500
ECS	CPUUtilization	>70%	Task 該加 CPU 或 scale out
NAT GW	BytesOutToDestination	看帳單	NAT 流量很貴

建立 CloudWatch Alarm

# CPU 超過 80% 持續 5 分鐘就通知
aws cloudwatch put-metric-alarm \
  --alarm-name "ec2-high-cpu" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:ops-alerts \
  --dimensions Name=InstanceId,Value=i-xxx
 
# RDS 連線數告警
aws cloudwatch put-metric-alarm \
  --alarm-name "rds-high-connections" \
  --metric-name DatabaseConnections \
  --namespace AWS/RDS \
  --statistic Average \
  --period 300 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:ap-northeast-1:123456789012:ops-alerts \
  --dimensions Name=DBInstanceIdentifier,Value=prod-postgres

自定義 Metrics

AWS 內建的指標不夠用？你可以推自己的：

# 推自定義 Metric（例如：訂單數）
aws cloudwatch put-metric-data --namespace "MyApp" \
  --metric-name OrderCount \
  --value 42 \
  --unit Count \
  --dimensions Environment=production,Service=checkout

# Python SDK（在你的 App 裡推 metric）
import boto3
 
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{
        'MetricName': 'OrderProcessingTime',
        'Value': 1.23,
        'Unit': 'Seconds',
        'Dimensions': [
            {'Name': 'Environment', 'Value': 'production'},
            {'Name': 'Service', 'Value': 'checkout'}
        ]
    }]
)

CloudWatch Logs：集中式日誌

ECS、Lambda、API Gateway 都能把 log 送到 CloudWatch Logs。

Log Group 管理

# 建立 Log Group + 設定保留期限
aws logs create-log-group --log-group-name /ecs/web-app
aws logs put-retention-policy --log-group-name /ecs/web-app --retention-in-days 30
 
# 查看最近的 log
aws logs tail /ecs/web-app --since 1h --follow
 
# 用 Logs Insights 查詢（像 SQL 一樣）
aws logs start-query \
  --log-group-name /ecs/web-app \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message
    | filter @message like /ERROR/
    | sort @timestamp desc
    | limit 50'

Logs Insights：像 SQL 一樣查 Log

CloudWatch Logs Insights 是 AWS 裡 CP 值最高的功能之一。語法簡單，查詢速度快。

-- 找最慢的 API endpoint
fields @timestamp, @message
| parse @message "* * * * *ms" as method, path, status, duration
| filter duration > 1000
| stats count() as slowCount, avg(duration) as avgDuration by path
| sort slowCount desc
| limit 10
 
-- 找 5XX 錯誤的趨勢
fields @timestamp
| filter @message like /HTTP\/\d\.\d" 5\d\d/
| stats count() as errorCount by bin(5m)

X-Ray：分散式追蹤

你的使用者說「頁面好慢」，但你的 App 有 5 個微服務，到底慢在哪一段？X-Ray 幫你把每個 request 的完整路徑畫出來。

啟用 X-Ray

# ECS Task Definition 加 X-Ray sidecar
# 在 containerDefinitions 裡加上：
{
  "name": "xray-daemon",
  "image": "amazon/aws-xray-daemon:3.3.7",
  "cpu": 32,
  "memoryReservation": 256,
  "portMappings": [
    { "containerPort": 2000, "protocol": "udp" }
  ]
}

# Python App 加 X-Ray SDK
from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.ext.flask.middleware import XRayMiddleware
 
app = Flask(__name__)
xray_recorder.configure(service='checkout-service')
XRayMiddleware(app, xray_recorder)
 
# 追蹤外部呼叫
@xray_recorder.capture('process_payment')
def process_payment(order_id):
    # 這段的延遲會被記錄
    response = requests.post('https://payment-api/charge', ...)
    return response

X-Ray 能告訴你什麼

每個 request 經過了哪些服務
每一段花了多少時間
哪個服務是瓶頸
錯誤發生在哪一段

Container Insights：ECS/EKS 的專屬儀表板

Container Insights 自動收集容器層級的指標：CPU、Memory、Network、Storage。

# 啟用 ECS Container Insights
aws ecs update-cluster-settings --cluster prod-cluster \
  --settings name=containerInsights,value=enabled
 
# 在 CloudWatch Console 就能看到：
# - 每個 Service 的 CPU/Memory 使用率
# - Task 數量趨勢
# - 網路流量

EKS 上的 Container Insights

# 安裝 CloudWatch Agent（用 Helm）
helm repo add aws-cloudwatch https://aws.github.io/helm-charts
helm install cloudwatch-agent aws-cloudwatch/amazon-cloudwatch-observability \
  --namespace amazon-cloudwatch --create-namespace \
  --set clusterName=prod-cluster \
  --set region=ap-northeast-1

監控三件套的組合

問題	用什麼
「系統整體健不健康？」	CloudWatch Dashboard
「某個 API 很慢」	X-Ray Service Map
「ECS Task 一直 restart」	Container Insights + CloudWatch Logs
「帳單突然變高」	CloudWatch Billing Alarm
「昨天凌晨發生什麼事？」	Logs Insights 回溯查詢

自架 vs AWS

面向	自架	AWS
Metrics	Prometheus + Grafana	CloudWatch Metrics + Dashboard
Logs	EFK (Elasticsearch + Fluentd + Kibana)	CloudWatch Logs + Logs Insights
Traces	Jaeger / Zipkin	X-Ray
容器監控	cAdvisor + kube-state-metrics	Container Insights
告警	Alertmanager	CloudWatch Alarms + SNS
成本	機器費用（EFK 超吃資源）	按 log 量 / metric 數計費
維護	高（Elasticsearch OOM 日常）	低（託管）

如果你讀過 Metrics & Monitoring 和 Log Management，CloudWatch 就是把 Prometheus + EFK 整合成一個服務——沒有自架的靈活度，但省去了維護 Elasticsearch 的痛苦。

K8s 映射

AWS 監控	K8s 對應
CloudWatch Metrics	Prometheus + kube-state-metrics
CloudWatch Alarms	Alertmanager
CloudWatch Dashboard	Grafana
CloudWatch Logs	Loki / EFK
Logs Insights	LogQL / Kibana
X-Ray	Jaeger / Tempo
Container Insights	Prometheus Operator + cAdvisor

詳細的 K8s 監控設定，看 K8s 監控。

系列導覽

上一篇	下一篇
DynamoDB	IRSA

監控的目的不是收集資料，是在出事之前就看到苗頭——或至少在出事的時候知道去哪裡找答案。

Terry Yao's Blog

分類

目錄